This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Dialect/
-
GPU/
-
CMakeLists.txt
-
TransformOps/
-
CMakeLists.txt
5/5
GPUTransformOps.h
9/9
GPUTransformOps.td
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.td
-
Transforms/
3/3
Transforms.h
1
InitAllDialects.h
-
lib/Dialect/
-
Dialect/
-
GPU/
1/1
CMakeLists.txt
-
TransformOps/
-
CMakeLists.txt
37/39
GPUTransformOps.cpp
-
Linalg/TransformOps/
-
TransformOps/
-
LinalgTransformOps.cpp
-
test/Dialect/
-
Dialect/
-
GPU/
-
transform-gpu.mlir
-
Linalg/
-
transform-gpu.mlir
-
utils/bazel/llvm-project-overlay/mlir/
-
bazel/
-
llvm-project-overlay/
-
mlir/
-
BUILD.bazel

Differential D134800

[mlir][transform] Create GPU transform dialect
ClosedPublic

Authored by guraypp on Sep 28 2022, 3:42 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
ftynse
bondhugula
ThomasRaoux
herhut

Commits

rG89bb0cae46f8: [mlir][transform] Create GPU transform dialect

Summary

This revision adds GPU transform dialect. It also introduce a prefix such as "transform.gpu" for all ops related to this dialect.

MLIR already had two GPU transform op in linalg. This revision moves these ops into GPUTransformOps. The Ops are as follows:

transform.structured.map_nested_foreach_thread_to_gpu_blocks -> transform.gpu.map_foreach_to_blocks
This op selects the outermost (toplevel) foreach_thread and parallelize across GPU blocks. It can also generate gpu_launch.

transform.structured.map_nested_foreach_thread_to_gpu_threads -> transform.gpu.map_nested_foreach_to_threads
This op parallelizes nested foreach_thread that are inside gpu_launch across GPU threads.

It doesn't add new functionality, but there are some minor refactoring of the code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Sep 28 2022, 3:42 AM

Herald added a reviewer: bondhugula. · View Herald TranscriptSep 28 2022, 3:42 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: zero9178, bzcheeseman, sdasgup3 and 21 others. · View Herald Transcript

guraypp requested review of this revision.Sep 28 2022, 3:42 AM

Herald added a reviewer: herhut. · View Herald TranscriptSep 28 2022, 3:42 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

include new files

Harbormaster completed remote builds in B189131: Diff 463492.Sep 28 2022, 4:01 AM

rebase

guraypp retitled this revision from [mlir][transform] Create GPU dialect transform to [mlir][transform] Create GPU transform dialect.Sep 28 2022, 5:27 AM

guraypp edited the summary of this revision. (Show Details)

guraypp edited the summary of this revision. (Show Details)Sep 28 2022, 5:37 AM

Harbormaster completed remote builds in B189144: Diff 463511.Sep 28 2022, 5:45 AM

guraypp edited the summary of this revision. (Show Details)Sep 28 2022, 6:05 AM

Thanks! I see this is mostly code motion, but it doesn't look like the original commit was reviewed in detail. Please note that stylistic comments (expanding auto, use of braces, comments) apply to the entire patch and not only where noted.

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.h
37
40
42
46	Pass `ArrayRef<int64_t>` if you are not intending to modify the content. (And `SmallVectorImpl &` if you are).
58	`SmallVectorImpl` if it's being pushed into, `MutableArrayRef` otherwise. Just `SmallVector` cannot be used with `SmallVector<?, 42>` or any specific stack-element size.
mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td
2	Nit: extend to 80 cols
27
28
30
33
38	I don't see how functions get involved here? This targets `gpu.launch` that has a region AFAIK.
43
46	Please wrap any IR name containing underscores into backticks systematically. Otherwise, it screws up formatting in the generated markdown documentation.
55–56	Please, please, please don't use `=====` in markdown. This is the page title and it breaks a lot of things including tables of contents, page headers and search terms.
mlir/lib/Dialect/GPU/CMakeLists.txt
85–102	Could we rather have a separate CMakeLists.txt inside TransformOps?
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
15	Any reason we need PDL and and not just PDLTypes?
35	Please document top-level functions.

ftynse added inline comments.Sep 28 2022, 8:14 AM

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
63	Nit: end comments with a full stop. Here and below.
87	Nit: it may be a good idea to create an InsertionGuard at the entry of the function so the rewriter is reset back to where it was on return.
103–105	Nit FYI: you can also do return op->emitError(), diagnostics are convertible to failure().
150	Please expand `auto` unless the type is clear from local context (there's a cast on the RHS) or difficult to spell (iterators, lambdas).
154–155	Nit: this is a "non-trivial" conditional because of a multi-line condition and therefore needs braces around the body.
176	Nit: we have C++17, `for (auto [threadIdx, blockOp] : llvm::zip(...))` should work here.
181–182	There's no point in jumping through the `updateRootInPlace` hoops if you are calling splice above. This function cannot be used with dialect conversion anyway.
196–197	Note that this will advance over the "foreach(target-op(foreach()))" construct where the inner foreach is top-level within the given target but has _some_ other foreach ancestor. This should probably check that the foreach ancestor of another foreach has target as ancestor.
199
212
213
238	IMO, this should be silenceable failure. We haven't irreversibly modified the IR yet.
241	Please expand auto.
243	Can we rather return a DiagnosedSilenceableFailure from findTopLevelForeachThreadOp instead of LogicalResult? Unknown errors are almost useless.
253–254	Same as above.
260	Use `cast` if the result is never checked for being non-null.
292	Nit: there is no `gpu.thread` operation, did you mean `thread_id`?
331	Nit: creating IR inside another IR-creation call is highly discouraged. This specifically is not problematic because there will be no ordering issue given there's only one nested operation, but some later modification may add a second one leading to unspecified order in some compilers.
365	Same comment as above about RAUW and splice.
508	Please add the newline.

This revision now requires changes to proceed.Sep 28 2022, 8:14 AM

Addressed ftynse comments

@ftynse thank you so much for reviewing this one! I addressed your comments.

Pass SmallVectorImpl instead SmallVector

This seems to include some irrelevant changes.

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
1340–1355	This change looks unrelated. Rebase went wrong?
1342	Nit: `///` here too, `//` inside functions only.
1353	Nit: this can be removed now.
mlir/include/mlir/InitAllDialects.h
117	Looks unrelated.
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
83–84	Nit: `DiagnosedSIlenceableFailure(diag.checkAndReport())` is shorter and will not drop the notes from the diagnostic. We can also add some `makeFailureDefinite()` that turns silenceable failure into a definite one and propagates success.
84	On a second thought, why is this a definite failure? We usually use definite failures when the transformation could have modified the IR in an irrecoverable way. Definite failures are bubbled up really fast and stop the entire transformation process. This looks more like failed precondition and we can attempt something else. IMO, this should be just simple propagation`if (!diag.succeeded()) return diag;`. I suppose that you have a longer transform combo in mind, something like "create gpu launch, then do something with it", but that part can also propagate silenceable failures. In the transform dialect context, if nothing explicitly suppresses the error (like `transform.alternatives` or `sequence failures(suppress)`), it will ultimately be reported. Here, by emitting a definite failure because of a precondition, you make it impossible to build an `alternatives` that tries to map to GPU and fallbacks to staying on CPU otherwise, for example.
118–119	Ditto.
196–197	This doesn't seem fixed.
223–226	Ditto.
256–259	DiagnosedSilenceableFailure diag = emitSilenceableError() << ""message"; diag.attachNote(target->getLoc()) << "when applied to this payload op"; return diag;

Please add tests for at least some of the user-visible error messages (those coming from emitError / silenceable diagnostic).

rebase

Harbormaster completed remote builds in B189387: Diff 463862.Sep 29 2022, 6:35 AM

update more definitive errors into silenceable failure

ftynse added inline comments.Sep 29 2022, 8:02 AM

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
438	Leftover debug

Harbormaster completed remote builds in B189413: Diff 463897.Sep 29 2022, 8:53 AM

guraypp added inline comments.Sep 29 2022, 9:08 AM

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
15	I removed PDLTypes. It looks like I need PDL not the PDLTypes because I add this as dependent dialect. declareDependentDialect<pdl::PDLDialect>();
181–182	I am not sure I understood this one. What else I can use instead of `updateRootInPlace`?
196–197	Good point! I noticed that it has many missing checks and this is one of them. For the case of " gpu::LaunchOp(foreach(target-op(foreach())))", it could be even gpu::LaunchOp before. It does not check that as well. I will handle these checks in another PR with some tests if it's okay?
331	Could you please elaborate this comment?

remove leftover debug

ftynse added inline comments.Sep 29 2022, 9:51 AM

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
181–182	Just ues replaceAllUsesWith directly, without wrapping into updateRootInPlace.
196–197	Ok.
331	If you do something like `b.create<MulFOp>(loc, b.create<AddFOp>(...), b.create<AddFOp>(...))`, the order in which the "add" operations will be created is unspecified because the order of function argument evaluation is unspecified and does vary between compilers/platforms. Meaning we cannot reliably test the created IR since we don't know the order of operations in it. This is not a problem if only _one_ of the operands creates additional IR, which is the case here. But it makes it easier to accidentally add a second such call during refactoring, especially if it is hidden in a function call. So we decided at some point that it is safer to _never_ call "create" within the arguments of another "create", not even once. Avoiding such nested create also makes the C++ code slightly more similar to the IR produced: there's one statement per produced operation.

Harbormaster completed remote builds in B189442: Diff 463932.Sep 29 2022, 10:25 AM

Remove updateRootInPlace.
Change nested IR generation.

guraypp marked 18 inline comments as done.Sep 30 2022, 1:20 AM

guraypp added inline comments.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
84	I return everything as a "silenceable failure" before modifying the IR. Returning "definitive failure" is useless due to several precondition checks. The user can specify "sequence failures(propagate)" if they want failures. Thanks for explaining this mechanism. My misunderstanding is caused because "silenceable failure" does not show up with the "applyToOne" as I demonstrated at https://reviews.llvm.org/D134886. Returning "silent failure" has the drawback that the results still need to be appropriately assigned. If results are not set, this time transform dialect itself fails because the number of expected results are not set.
331	Thanks for the explanation, now I understand what you mean. Yes, the order of generating arguments is undefined. I did an experiment in the link below, gcc and clang on x86 generates the order in the different way. I put here for in case someone wonders in the future :) https://godbolt.org/z/TYf3cqxx5

Harbormaster completed remote builds in B189611: Diff 464171.Sep 30 2022, 1:20 AM

You probably need to rebase on top of the error propagation fix.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
84	The issues with failure propagation should be fixed after https://reviews.llvm.org/D134948 lands. For results, we can add/improve a helper that associates all transform op results (getNumResults) with an empty list of payload iR handles.

This revision is now accepted and ready to land.Sep 30 2022, 4:22 AM

Update prefixed accessors (https://github.com/llvm/llvm-project/commit/10c04f464138012f0930882465eff90b74d8fd1d)
Add bazel build
Rebase

Harbormaster completed remote builds in B189933: Diff 464613.Oct 3 2022, 1:15 AM

minor fixes wrt error reporting

guraypp added a child revision: D135063: [mlir][transform] Add failing test for GPU transform dialect.Oct 3 2022, 1:52 AM

Harbormaster completed remote builds in B189937: Diff 464622.Oct 3 2022, 2:20 AM

fix bazel build

Harbormaster completed remote builds in B189951: Diff 464638.Oct 3 2022, 3:13 AM

Made transformOp optional like llvm::Optional<TransformOpInterface> transformOp for mapNestedForeachToThreadsImp function. This function is needed to be used outside of transform dialect. In this case there is no TransformOpInterface.

Harbormaster completed remote builds in B189967: Diff 464664.Oct 3 2022, 6:48 AM

rebase.

Harbormaster completed remote builds in B189994: Diff 464701.Oct 3 2022, 9:07 AM

rebase

Harbormaster completed remote builds in B190125: Diff 464898.Oct 4 2022, 12:30 AM

rebase

Harbormaster completed remote builds in B190147: Diff 464929.Oct 4 2022, 2:46 AM

rebase

This revision was landed with ongoing or failed builds.Oct 4 2022, 4:09 AM

Closed by commit rG89bb0cae46f8: [mlir][transform] Create GPU transform dialect (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG89bb0cae46f8: [mlir][transform] Create GPU transform dialect.

Harbormaster completed remote builds in B190159: Diff 464946.Oct 4 2022, 4:24 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

CMakeLists.txt

1 line

TransformOps/

CMakeLists.txt

6 lines

GPUTransformOps.h

75 lines

GPUTransformOps.td

175 lines

Linalg/

TransformOps/

LinalgTransformOps.td

155 lines

Transforms/

Transforms.h

26 lines

InitAllDialects.h

2 lines

lib/

Dialect/

GPU/

CMakeLists.txt

2 lines

TransformOps/

CMakeLists.txt

18 lines

GPUTransformOps.cpp

507 lines

Linalg/

TransformOps/

LinalgTransformOps.cpp

386 lines

test/

Dialect/

	GPU/
	Linalg/

transform-gpu.mlir

10 lines

Linalg/

transform-gpu.mlir

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

60 lines

Diff 464948

mlir/include/mlir/Dialect/GPU/CMakeLists.txt

	add_subdirectory(IR)			add_subdirectory(IR)
	add_subdirectory(Transforms)			add_subdirectory(Transforms)
				add_subdirectory(TransformOps)

mlir/include/mlir/Dialect/GPU/TransformOps/CMakeLists.txt

This file was added.

				set(LLVM_TARGET_DEFINITIONS GPUTransformOps.td)
				mlir_tablegen(GPUTransformOps.h.inc -gen-op-decls)
				mlir_tablegen(GPUTransformOps.cpp.inc -gen-op-defs)
				add_public_tablegen_target(MLIRGPUTransformOpsIncGen)

				add_mlir_doc(GPUTransformOps GPUTransformOps Dialects/ -gen-op-doc)

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.h

This file was added.

//===- GPUTransformOps.h - GPU transform ops --------------------*- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#ifndef MLIR_DIALECT_GPU_TRANSFORMOPS_GPUTRANSFORMOPS_H

#define MLIR_DIALECT_GPU_TRANSFORMOPS_GPUTRANSFORMOPS_H

#include "mlir/Dialect/PDL/IR/PDLTypes.h"

#include "mlir/Dialect/SCF/IR/SCF.h"

#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"

#include "mlir/IR/OpImplementation.h"

#include "mlir/IR/PatternMatch.h"

namespace mlir {

namespace gpu {

class GpuOp;

} // namespace gpu

} // namespace mlir

//===----------------------------------------------------------------------===//

// GPU Transform Operations

//===----------------------------------------------------------------------===//

#define GET_OP_CLASSES

#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h.inc"

namespace mlir {

class DialectRegistry;

namespace transform {

namespace gpu {

/// Searches `scf.foreach_thread` ops nested under `target` and maps each such

/// op to GPU threads. Mapping is one-to-one and the induction variables of

ftynseUnsubmitted

Done

namespace gpu {

- /// Searches `scf.foreach_thread` ops nested under `target` and maps each such

+ /// Searches for `scf.foreach_thread` ops nested under `target` and maps each such

/// op to GPU threads. Mapping is one-to-one and the induction variables of

ftynse:

/// `scf.foreach_thread` are rewritten to gpu.thread_id according to the

/// thread_dim_apping attribute. Sibling `scf.foreach_thread` are supported in

/// which case, the union of the number of threads is computed and may result in

ftynseUnsubmitted

Done

/// `scf.foreach_thread` are rewritten to gpu.thread_id according to the

- /// thread_dim_apping attribute. Sibling `scf.foreach_thread` are supported in

+ /// thread_dim_mapping attribute. Sibling `scf.foreach_thread` are supported in

/// which case, the union of the number of threads is computed and may result in

ftynse:

/// predication. Dynamic, `scf.foreach_thread` trip counts are currently not

/// supported. Dynamic block dim sizes are currently not supported.

ftynseUnsubmitted

Done

/// which case, the union of the number of threads is computed and may result in

- /// predication. Dynamic, `scf.foreach_thread` trip counts are currently not

+ /// predication. Dynamic `scf.foreach_thread` trip counts are currently not

/// supported. Dynamic block dim sizes are currently not supported.

ftynse:

DiagnosedSilenceableFailure

mapNestedForeachToThreadsImp(RewriterBase &rewriter, Operation *target,

const SmallVectorImpl<int64_t> &blockDim,

bool syncAfterDistribute,

ftynseUnsubmitted

Done

Pass ArrayRef<int64_t> if you are not intending to modify the content. (And SmallVectorImpl & if you are).

ftynse: Pass `ArrayRef<int64_t>` if you are not intending to modify the content. (And `SmallVectorImpl…

llvm::Optional<TransformOpInterface> transformOp);

/// Maps the top level `scf.foreach_thread` op to GPU Thread Blocks. Mapping is

/// one-to-one and the induction variables of `scf.foreach_thread` are rewritten

/// to gpu.block_id according to the thread_dim_apping attribute. Dynamic,

/// `scf.foreach_thread` trip counts are currently not supported. Dynamic block

/// dim sizes are currently not supported.

DiagnosedSilenceableFailure mapForeachToBlocksImp(

RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,

function_ref<void(RewriterBase &, scf::ForeachThreadOp,

SmallVectorImpl<Value> &)>

blockIdGenerator,

ftynseUnsubmitted

Done

SmallVectorImpl if it's being pushed into, MutableArrayRef otherwise. Just SmallVector cannot be used with SmallVector<?, 42> or any specific stack-element size.

ftynse: `SmallVectorImpl` if it's being pushed into, `MutableArrayRef` otherwise. Just `SmallVector`…

SmallVectorImpl<int64_t> &gridDims, TransformOpInterface transformOp);

/// Finds the top level scf::ForeachThreadOp of given target.

DiagnosedSilenceableFailure

findTopLevelForeachThreadOp(Operation *target,

scf::ForeachThreadOp &topLevelForeachThreadOp,

TransformOpInterface transformOp);

} // namespace gpu

} // namespace transform

namespace gpu {

void registerTransformDialectExtension(DialectRegistry &registry);

} // namespace gpu

} // namespace mlir

#endif // MLIR_DIALECT_GPU_TRANSFORMOPS_GPUTRANSFORMOPS_H

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td

This file was added.

//===- GPUTransformOps.td - GPU transform ops --------------*- tablegen -*-===//

ftynseUnsubmitted

Done

Nit: extend to 80 cols

ftynse: Nit: extend to 80 cols

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#ifndef GPU_TRANSFORM_OPS

#define GPU_TRANSFORM_OPS

include "mlir/Dialect/Transform/IR/TransformDialect.td"

include "mlir/Dialect/Transform/IR/TransformEffects.td"

include "mlir/Dialect/Transform/IR/TransformInterfaces.td"

include "mlir/Dialect/PDL/IR/PDLTypes.td"

include "mlir/Interfaces/SideEffectInterfaces.td"

include "mlir/IR/OpBase.td"

def MapNestedForeachToThreads :

Op<Transform_Dialect, "gpu.map_nested_foreach_to_threads",

[FunctionalStyleTransformOpTrait,

MemoryEffectsOpInterface,

TransformEachOpTrait,

TransformOpInterface]> {

let description = [{

Target the `gpu.launch op` and rewrite all `scf.foreach_thread`

nested in it to distributed `gpu.thread_id` attribute.

ftynseUnsubmitted

Done

let description = [{

- Target the `gpu_launch op` and rewrite all `scf.foreach_thread`

+ Target the `gpu.launch op` and rewrite all `scf.foreach_thread`

to distributed gpu.thread_id attribute.

ftynse:

ftynseUnsubmitted

Done

Target the `gpu_launch op` and rewrite all `scf.foreach_thread`

- to distributed gpu.thread_id attribute.

+ nested in it to distributed `gpu.thread_id` attribute.

The operation searches `scf.foreach_thread` ops nested under `target`

ftynse:

The operation searches for `scf.foreach_thread` ops nested under `target`

and maps each such op to GPU threads. Mapping is one-to-one and the

ftynseUnsubmitted

Done

to distributed gpu.thread_id attribute.

- The operation searches `scf.foreach_thread` ops nested under `target`

+ The operation searches for `scf.foreach_thread` ops nested under `target`

and maps each such op to GPU threads. Mapping is one-to-one and the

ftynse:

induction variables of `scf.foreach_thread` are rewritten to

`gpu.thread_id` according to the `thread_dim_mapping` attribute.

ftynseUnsubmitted

Done

induction variables of `scf.foreach_thread` are rewritten to

- gpu.thread_id according to the thread_dim_apping attribute.

+ `gpu.thread_id` according to the `thread_dim_mapping` attribute.

Sibling `scf.foreach_thread` are supported in which case, the union of

ftynse:

Sibling `scf.foreach_thread` are supported in which case, the union of

the number of threads is computed and may result in predication.

Multiple scf.foreach_thread are supported per `gpu.launch` in which case,

the max of all the threads is computed and taken for the global

ftynseUnsubmitted

Done

I don't see how functions get involved here? This targets gpu.launch that has a region AFAIK.

ftynse: I don't see how functions get involved here? This targets `gpu.launch` that has a region AFAIK.

`gpu.thread_id`. If necessary, `scf.foreach_thread` that do not use the

whole thread range result in predicated computations.

Dynamic `scf.foreach_thread` trip counts are currently not supported.

Dynamic block dim sizes are currently not supported.

ftynseUnsubmitted

Done

result in predicated computations.

- Dynamic, `scf.foreach_thread` trip counts are currently not supported.

+ Dynamic `scf.foreach_thread` trip counts are currently not supported.

Dynamic block dim sizes are currently not supported.

ftynse:

Only **bufferized** `scf.foreach_thread` are currently supported.

Only `scf.foreach_thread` distributed to **at most 3 dimensions** are

ftynseUnsubmitted

Done

Dynamic block dim sizes are currently not supported.

- Only **bufferized** scf.foreach_thread are currently supported.

+ Only **bufferized** `scf.foreach_thread` are currently supported.

Only scf.foreach_thread distributed to **at most 3 dimensions** are

Please wrap any IR name containing underscores into backticks systematically. Otherwise, it screws up formatting in the generated markdown documentation.

ftynse: Please wrap any IR name containing underscores into backticks systematically. Otherwise, it…

currently supported.

Barriers are inserted after each scf.foreach_thread op for now.

The operation alters the block size of the given gpu_launch using

blockDim argument.

#### Return modes:

This operation ignores non-gpu_launch ops and drops them in the return.

ftynseUnsubmitted

Done

blockDim argument.

- Return modes:

- =============

+ #### Return modes:

This operation ignores non-gpu_launch ops and drops them in the return.

Please, please, please don't use ===== in markdown. This is the page title and it breaks a lot of things including tables of contents, page headers and search terms.

ftynse: Please, please, please don't use `=====` in markdown. This is the page title and it breaks a…

If any scf.foreach_thread with tensors is found, the transform definitely

fails.

If all the scf.foreach_thread operations contained within the LaunchOp

referred to by the `target` PDLOperation lower to GPU properly, the

transform succeeds. Otherwise the transform definitely fails.

The returned handle points to the same LaunchOp operand, consuming it and

producing a new SSA value to satisfy chaining and linearity of the IR

properties.

#### Example:

```

gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)

threads(%tx, %ty, %tz) in (%tx = %3, %ty = %4, %tz = %5) {

scf.foreach_thread (%i, %j) in (7, 9) {

... // body 1

} {thread_dim_mapping = [1, 0, 2]}

scf.foreach_thread (%i) in (12) {

... // body 2

}

gpu.terminator

}

```

is translated to:

```

%bdimX = arith.constant 12 : index

%bdimY = arith.constant 9 : index

gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)

threads(%tx, %ty, %tz) in (%tx = %bdimX, %ty = %bdimY, %tz = %5) {

if (threadIdx.x < 9 && threadIdx.y < 7) {

... // body 1

}

gpu.barrier

if (threadIdx.y < 1) {

... // body 2

}

gpu.barrier

gpu.terminator

}

```

}];

let arguments = (ins PDL_Operation:$target,

DefaultValuedAttr<I64ArrayAttr, "{}">:$blockDim,

DefaultValuedAttr<BoolAttr, "true">:$syncAfterDistribute);

let results = (outs PDL_Operation:$result);

let assemblyFormat = "$target attr-dict";

let extraClassDeclaration = [{

::mlir::DiagnosedSilenceableFailure applyToOne(

::mlir::Operation *target,

::llvm::SmallVectorImpl<::mlir::Operation *> &results,

::mlir::transform::TransformState &state);

}];

}

def MapForeachToBlocks :

Op<Transform_Dialect, "gpu.map_foreach_to_blocks",

[FunctionalStyleTransformOpTrait,

MemoryEffectsOpInterface,

TransformOpInterface,

TransformEachOpTrait]> {

let description = [{

Target the gpu_launch op and rewrite the top level `scf.foreach_thread`

to distributed gpu.block_id attribute. If `generate_gpu_launch` attribute

is set, then first generates `gpu_launch` and moves the top level

`scf.foreach_thread` inside.

The operation searches top level `scf.foreach_thread` ops under

`gpu_launch` and maps each such op to GPU blocks. Mapping is

one-to-one and the induction variables of `scf.foreach_thread` are

rewritten to gpu.block_id according to the `thread_dim_apping` attribute.

Dynamic, `scf.foreach_thread` trip counts are currently not supported.

Dynamic block dim sizes are currently not supported.

Only **bufferized** scf.foreach_thread are currently supported.

Only scf.foreach_thread distributed to **at most 3 dimensions** are

currently supported.

The operation alters the block size of the given gpu_launch using

gridDim argument.

#### Return modes:

This operation ignores non-gpu_launch ops and drops them in the return.

If any scf.foreach_thread with tensors is found, the transform definitely

fails.

If all the scf.foreach_thread operations contained within the LaunchOp

referred to by the `target` PDLOperation lower to GPU properly, the

transform succeeds. Otherwise the transform definitely fails.

The returned handle points to the same LaunchOp operand, consuming it and

producing a new SSA value to satisfy chaining and linearity of the IR

properties.

}];

let arguments = (ins PDL_Operation:$target,

DefaultValuedAttr<I64ArrayAttr, "{}">:$gridDim,

UnitAttr:$generate_gpu_launch);

let results = (outs PDL_Operation:$result);

let assemblyFormat = "$target attr-dict";

let extraClassDeclaration = [{

::mlir::DiagnosedSilenceableFailure applyToOne(

::mlir::Operation *target,

::llvm::SmallVectorImpl<::mlir::Operation *> &results,

::mlir::transform::TransformState &state);

}];

}

#endif // GPU_TRANSFORM_OPS

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

Show First 20 Lines • Show All 745 Lines • ▼ Show 20 Lines	::mlir::DiagnosedSilenceableFailure apply(
::mlir::transform::TransformResults &transformResults,		::mlir::transform::TransformResults &transformResults,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);

::llvm::SmallVector<::mlir::OpFoldResult> getMixedNumThreads();		::llvm::SmallVector<::mlir::OpFoldResult> getMixedNumThreads();
::llvm::SmallVector<::mlir::OpFoldResult> getMixedTileSizes();		::llvm::SmallVector<::mlir::OpFoldResult> getMixedTileSizes();
}];		}];
}		}

def MapNestedForeachThreadToGpuThreads :
Op<Transform_Dialect, "structured.map_nested_foreach_thread_to_gpu_threads",
[FunctionalStyleTransformOpTrait,
MemoryEffectsOpInterface,
TransformEachOpTrait,
TransformOpInterface]> {
let description = [{
Target the gpu_launch op and rewrite all scf.foreach_thread
to distributed gpu.thread_id attribute.

The operation searches `scf.foreach_thread` ops nested under `target`
and maps each such op to GPU threads. Mapping is one-to-one and the
induction variables of `scf.foreach_thread` are rewritten to
gpu.thread_id according to the thread_dim_apping attribute.

Sibling `scf.foreach_thread` are supported in which case, the union of
the number of threads is computed and may result in predication.

Multiple scf.foreach_thread are supported per function in which case, the
max of all the threads is computed and taken for the global gpu.thread_id.
If necessary, scf.foreach_thread that do not use the whole thread range
result in predicated computations.

Dynamic, `scf.foreach_thread` trip counts are currently not supported.
Dynamic block dim sizes are currently not supported.

Only bufferized scf.foreach_thread are currently supported.
Only scf.foreach_thread distributed to at most 3 dimensions are
currently supported.

Barriers are inserted after each scf.foreach_thread op for now.

The operation alters the block size of the given gpu_launch using
blockDim argument.

#### Return modes:

This operation ignores non-gpu_launch ops and drops them in the return.

If any scf.foreach_thread with tensors is found, the transform definitely
fails.

If all the scf.foreach_thread operations contained within the LaunchOp
referred to by the `target` PDLOperation lower to GPU properly, the
transform succeeds. Otherwise the transform definitely fails.

The returned handle points to the same LaunchOp operand, consuming it and
producing a new SSA value to satisfy chaining and linearity of the IR
properties.

#### Example:

```
gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)
threads(%tx, %ty, %tz) in (%tx = %3, %ty = %4, %tz = %5) {
scf.foreach_thread (%i, %j) in (7, 9) {
... // body 1
} {thread_dim_mapping = [1, 0, 2]}
scf.foreach_thread (%i) in (12) {
... // body 2
}
gpu.terminator
}
```
is translated to:

```
%bdimX = arith.constant 12 : index
%bdimY = arith.constant 9 : index
gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)
threads(%tx, %ty, %tz) in (%tx = %bdimX, %ty = %bdimY, %tz = %5) {
if (threadIdx.x < 9 && threadIdx.y < 7) {
... // body 1
}
gpu.barrier
if (threadIdx.y < 1) {
... // body 2
}
gpu.barrier
gpu.terminator
}
```
}];

let arguments = (ins PDL_Operation:$target,
DefaultValuedAttr<I64ArrayAttr, "{}">:$blockDim,
DefaultValuedAttr<BoolAttr, "true">:$syncAfterDistribute);
let results = (outs PDL_Operation:$result);

let assemblyFormat = "$target attr-dict";
let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure applyToOne(
::mlir::Operation *target,
::llvm::SmallVectorImpl<::mlir::Operation *> &results,
::mlir::transform::TransformState &state);
}];
}

def MapNestedForeachThreadToGpuBlocks : Op<Transform_Dialect,
"structured.map_nested_foreach_thread_to_gpu_blocks",
[FunctionalStyleTransformOpTrait,
MemoryEffectsOpInterface,
TransformOpInterface,
TransformEachOpTrait]> {
let description = [{
Target the gpu_launch op and rewrite the top level `scf.foreach_thread`
to distributed gpu.block_id attribute. If `generate_gpu_launch` attribute
is set, then first generates `gpu_launch` and moves the top level
`scf.foreach_thread` inside.

The operation searches top level `scf.foreach_thread` ops under
`gpu_launch` and maps each such op to GPU blocks. Mapping is
one-to-one and the induction variables of `scf.foreach_thread` are
rewritten to gpu.block_id according to the `thread_dim_apping` attribute.

Dynamic, `scf.foreach_thread` trip counts are currently not supported.
Dynamic block dim sizes are currently not supported.

Only bufferized scf.foreach_thread are currently supported.
Only scf.foreach_thread distributed to at most 3 dimensions are
currently supported.

The operation alters the block size of the given gpu_launch using
gridDim argument.

#### Return modes:

This operation ignores non-gpu_launch ops and drops them in the return.

If any scf.foreach_thread with tensors is found, the transform definitely
fails.

If all the scf.foreach_thread operations contained within the LaunchOp
referred to by the `target` PDLOperation lower to GPU properly, the
transform succeeds. Otherwise the transform definitely fails.

The returned handle points to the same LaunchOp operand, consuming it and
producing a new SSA value to satisfy chaining and linearity of the IR
properties.
}];

let arguments = (ins PDL_Operation:$target,
DefaultValuedAttr<I64ArrayAttr, "{}">:$gridDim,
UnitAttr:$generate_gpu_launch);
let results = (outs PDL_Operation:$result);

let assemblyFormat = "$target attr-dict";
let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure applyToOne(
::mlir::Operation *target,
::llvm::SmallVectorImpl<::mlir::Operation *> &results,
::mlir::transform::TransformState &state);
}];
}

def VectorizeOp : Op<Transform_Dialect, "structured.vectorize",		def VectorizeOp : Op<Transform_Dialect, "structured.vectorize",
[FunctionalStyleTransformOpTrait, MemoryEffectsOpInterface,		[FunctionalStyleTransformOpTrait, MemoryEffectsOpInterface,
TransformEachOpTrait, TransformOpInterface]> {		TransformEachOpTrait, TransformOpInterface]> {
let description = [{		let description = [{
Indicates that the given `target` op all the ops it contains should be		Indicates that the given `target` op all the ops it contains should be
vectorized with the configuration specified by the attributes of this op.		vectorized with the configuration specified by the attributes of this op.
This vectorization only handles structured ops that operate on shaped types		This vectorization only handles structured ops that operate on shaped types
and does not vectorize loops or straight-line. Internally, it applies a		and does not vectorize loops or straight-line. Internally, it applies a
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
bool areElementwiseOpsFusable(OpOperand *fusedOperand);		bool areElementwiseOpsFusable(OpOperand *fusedOperand);

/// Fuse two `linalg.generic` operations that have a producer-consumer		/// Fuse two `linalg.generic` operations that have a producer-consumer
/// relationship captured through `fusedOperand`. The method expects		/// relationship captured through `fusedOperand`. The method expects
/// that `areElementwiseOpsFusable` returns true for the given `fusedOperand`.		/// that `areElementwiseOpsFusable` returns true for the given `fusedOperand`.
FailureOr<Operation *> fuseElementwiseOps(RewriterBase &rewriter,		FailureOr<Operation *> fuseElementwiseOps(RewriterBase &rewriter,
OpOperand *fusedOperand);		OpOperand *fusedOperand);

/// Maps the top level `scf.foreach_thread` op to GPU Thread Blocks. Mapping is
/// one-to-one and the induction variables of `scf.foreach_thread` are rewritten
/// to gpu.block_id according to the thread_dim_apping attribute. Dynamic,
/// `scf.foreach_thread` trip counts are currently not supported. Dynamic block
/// dim sizes are currently not supported.
LogicalResult rewriteTopLevelForeachThreadToGpuBlocks(
RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,
function_ref<void(RewriterBase &, scf::ForeachThreadOp,
SmallVector<Value> &)>
blockIdGenerator,
SmallVector<int64_t> &gridDims);

/// Finds the top level scf::ForeachThreadOp of given target.
FailureOr<scf::ForeachThreadOp> findTopLevelForeachThreadOp(Operation *target);

/// Searches `scf.foreach_thread` ops nested under `target` and maps each such
/// op to GPU threads. Mapping is one-to-one and the induction variables of
/// `scf.foreach_thread` are rewritten to gpu.thread_id according to the
/// thread_dim_apping attribute. Sibling `scf.foreach_thread` are supported in
/// which case, the union of the number of threads is computed and may result in
/// predication. Dynamic, `scf.foreach_thread` trip counts are currently not
/// supported. Dynamic block dim sizes are currently not supported.
mlir::WalkResult rewriteMapNestedForeachThreadToGpuThreads(
RewriterBase &rewriter, Operation *target,
const SmallVector<int64_t> &blockDim, bool syncAfterDistribute);

/// Split the given `op` into two parts along the given iteration space		/// Split the given `op` into two parts along the given iteration space
/// `dimension` at the specified `splitPoint`, and return the two parts.		/// `dimension` at the specified `splitPoint`, and return the two parts.
///		///
/// For example, the following op:		/// For example, the following op:
///		///
/// linalg.matmul ins(%0, %1 : tensor<128x32xf32>, tensor<32x64xf32>)		/// linalg.matmul ins(%0, %1 : tensor<128x32xf32>, tensor<32x64xf32>)
/// outs(%2 : tensor<128x64xf32>)		/// outs(%2 : tensor<128x64xf32>)
///		///
▲ Show 20 Lines • Show All 1,196 Lines • ▼ Show 20 Lines	static void insert(RewritePatternSet &patterns,
const LinalgTilingOptions &options,		const LinalgTilingOptions &options,
const LinalgTransformationFilter &f) {		const LinalgTransformationFilter &f) {
patterns.add<LinalgTilingPattern>(OpTy::getOperationName(),		patterns.add<LinalgTilingPattern>(OpTy::getOperationName(),
patterns.getContext(), options, f);		patterns.getContext(), options, f);
TilingPatterns<OpTypes...>::insert(patterns, options, f);		TilingPatterns<OpTypes...>::insert(patterns, options, f);
}		}
};		};

/// Split Reduction options.		/// Split Reduction options.
struct SplitReductionOptions {		struct SplitReductionOptions {
// Ratio used to split the reduction dimension. If the ratio is <= 1, nothing		// Ratio used to split the reduction dimension. If the ratio is <= 1, nothing
		ftynseUnsubmitted Done Reply Inline Actions Nit: `///` here too, `//` inside functions only. ftynse: Nit: `///` here too, `//` inside functions only.
// will be done.		// will be done.
int64_t ratio = 0;		int64_t ratio = 0;
// Index where the extra dimension is added to the intermediate tensor shape.		// Index where the extra dimension is added to the intermediate tensor shape.
unsigned index = 0;		unsigned index = 0;
// If the inner dimension after splitting is parallel or reduction.		// If the inner dimension after splitting is parallel or reduction.
bool innerParallel = false;		bool innerParallel = false;
};		};

/// Function signature to control reduction splitting. This returns		/// Function signature to control reduction splitting. This returns
/// `SplitReductionOptions`.		/// `SplitReductionOptions`.
// TODO: don't use unsigned unless doing bit manipulation.		// TODO: don't use unsigned unless doing bit manipulation.
		ftynseUnsubmitted Done Reply Inline Actions Nit: this can be removed now. ftynse: Nit: this can be removed now.
using ControlSplitReductionFn =		using ControlSplitReductionFn =
std::function<SplitReductionOptions(LinalgOp op)>;		std::function<SplitReductionOptions(LinalgOp op)>;
		ftynseUnsubmitted Done Reply Inline Actions This change looks unrelated. Rebase went wrong? ftynse: This change looks unrelated. Rebase went wrong?

/// Patterns to apply `splitReduction` below.		/// Patterns to apply `splitReduction` below.
void populateSplitReductionPattern(		void populateSplitReductionPattern(
RewritePatternSet &patterns,		RewritePatternSet &patterns,
const ControlSplitReductionFn &controlSplitReductionFn,		const ControlSplitReductionFn &controlSplitReductionFn,
const LinalgTransformationFilter &f = LinalgTransformationFilter(),		const LinalgTransformationFilter &f = LinalgTransformationFilter(),
bool useAlloc = false);		bool useAlloc = false);

▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

mlir/include/mlir/InitAllDialects.h

Show All 25 Lines
#include "mlir/Dialect/Bufferization/TransformOps/BufferizationTransformOps.h"		#include "mlir/Dialect/Bufferization/TransformOps/BufferizationTransformOps.h"
#include "mlir/Dialect/Bufferization/Transforms/FuncBufferizableOpInterfaceImpl.h"		#include "mlir/Dialect/Bufferization/Transforms/FuncBufferizableOpInterfaceImpl.h"
#include "mlir/Dialect/Complex/IR/Complex.h"		#include "mlir/Dialect/Complex/IR/Complex.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"
#include "mlir/Dialect/DLTI/DLTI.h"		#include "mlir/Dialect/DLTI/DLTI.h"
#include "mlir/Dialect/EmitC/IR/EmitC.h"		#include "mlir/Dialect/EmitC/IR/EmitC.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
		#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"		#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.h"		#include "mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.h"
#include "mlir/Dialect/Linalg/Transforms/BufferizableOpInterfaceImpl.h"		#include "mlir/Dialect/Linalg/Transforms/BufferizableOpInterfaceImpl.h"
#include "mlir/Dialect/Linalg/Transforms/TilingInterfaceImpl.h"		#include "mlir/Dialect/Linalg/Transforms/TilingInterfaceImpl.h"
#include "mlir/Dialect/MLProgram/IR/MLProgram.h"		#include "mlir/Dialect/MLProgram/IR/MLProgram.h"
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	registry.insert<acc::OpenACCDialect,
transform::TransformDialect,		transform::TransformDialect,
tosa::TosaDialect,		tosa::TosaDialect,
x86vector::X86VectorDialect>();		x86vector::X86VectorDialect>();
// clang-format on		// clang-format on

// Register all dialect extensions.		// Register all dialect extensions.
bufferization::registerTransformDialectExtension(registry);		bufferization::registerTransformDialectExtension(registry);
linalg::registerTransformDialectExtension(registry);		linalg::registerTransformDialectExtension(registry);
memref::registerTransformDialectExtension(registry);		memref::registerTransformDialectExtension(registry);
		ftynseUnsubmitted Not Done Reply Inline Actions Looks unrelated. ftynse: Looks unrelated.
scf::registerTransformDialectExtension(registry);		scf::registerTransformDialectExtension(registry);
		gpu::registerTransformDialectExtension(registry);

// Register all external models.		// Register all external models.
arith::registerBufferizableOpInterfaceExternalModels(registry);		arith::registerBufferizableOpInterfaceExternalModels(registry);
bufferization::func_ext::registerBufferizableOpInterfaceExternalModels(		bufferization::func_ext::registerBufferizableOpInterfaceExternalModels(
registry);		registry);
linalg::registerBufferizableOpInterfaceExternalModels(registry);		linalg::registerBufferizableOpInterfaceExternalModels(registry);
linalg::registerTilingInterfaceExternalModels(registry);		linalg::registerTilingInterfaceExternalModels(registry);
scf::registerBufferizableOpInterfaceExternalModels(registry);		scf::registerBufferizableOpInterfaceExternalModels(registry);
Show All 18 Lines

mlir/lib/Dialect/GPU/CMakeLists.txt

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	add_mlir_dialect_library(MLIRGPUTransforms
MLIRLLVMToLLVMIRTranslation		MLIRLLVMToLLVMIRTranslation
MLIRMemRefDialect		MLIRMemRefDialect
MLIRPass		MLIRPass
MLIRSCFDialect		MLIRSCFDialect
MLIRSupport		MLIRSupport
MLIRTransformUtils		MLIRTransformUtils
)		)

		add_subdirectory(TransformOps)

if(MLIR_ENABLE_CUDA_RUNNER)		if(MLIR_ENABLE_CUDA_RUNNER)
if(NOT MLIR_ENABLE_CUDA_CONVERSIONS)		if(NOT MLIR_ENABLE_CUDA_CONVERSIONS)
message(SEND_ERROR		message(SEND_ERROR
"Building mlir with cuda support requires the NVPTX backend")		"Building mlir with cuda support requires the NVPTX backend")
endif()		endif()

# Configure CUDA language support. Using check_language first allows us to		# Configure CUDA language support. Using check_language first allows us to
# give a custom error message.		# give a custom error message.
include(CheckLanguage)		include(CheckLanguage)
check_language(CUDA)		check_language(CUDA)
if (CMAKE_CUDA_COMPILER)		if (CMAKE_CUDA_COMPILER)
enable_language(CUDA)		enable_language(CUDA)
else()		else()
message(SEND_ERROR		message(SEND_ERROR
"Building mlir with cuda support requires a working CUDA install")		"Building mlir with cuda support requires a working CUDA install")
endif()		endif()
		ftynseUnsubmitted Done Reply Inline Actions Could we rather have a separate CMakeLists.txt inside TransformOps? ftynse: Could we rather have a separate CMakeLists.txt inside TransformOps?

# Enable gpu-to-cubin pass.		# Enable gpu-to-cubin pass.
target_compile_definitions(obj.MLIRGPUTransforms		target_compile_definitions(obj.MLIRGPUTransforms
PRIVATE		PRIVATE
MLIR_GPU_TO_CUBIN_PASS_ENABLE=1		MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
)		)

# Add CUDA headers includes and the libcuda.so library.		# Add CUDA headers includes and the libcuda.so library.
Show All 34 Lines

mlir/lib/Dialect/GPU/TransformOps/CMakeLists.txt

This file was added.

				add_mlir_dialect_library(MLIRGPUTransformOps
				GPUTransformOps.cpp

				ADDITIONAL_HEADER_DIRS
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU/TransformOps

				DEPENDS
				MLIRGPUTransformOpsIncGen

				LINK_LIBS PUBLIC
				MLIRIR
				MLIRGPUTransforms
				MLIRParser
				MLIRPDLDialect
				MLIRSideEffectInterfaces
				MLIRTransformDialect
				MLIRGPUOps
				)

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

This file was added.

//===- GPUTransformOps.cpp - Implementation of GPU transform ops ----------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h"

#include "mlir/Dialect/Arith/IR/Arith.h"

#include "mlir/Dialect/GPU/IR/GPUDialect.h"

#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h"

#include "mlir/Dialect/PDL/IR/PDL.h"

#include "mlir/Dialect/SCF/IR/SCF.h"

ftynseUnsubmitted

Done

Any reason we need PDL and and not just PDLTypes?

ftynse: Any reason we need PDL and and not just PDLTypes?

gurayppAuthorUnsubmitted

Done

I removed PDLTypes. It looks like I need PDL not the PDLTypes because I add this as dependent dialect.

declareDependentDialect<pdl::PDLDialect>();

guraypp: I removed PDLTypes. It looks like I need PDL not the PDLTypes because I add this as dependent…

#include "mlir/Dialect/Transform/IR/TransformDialect.h"

#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"

#include "mlir/IR/Diagnostics.h"

#include "mlir/IR/Value.h"

#include "llvm/ADT/None.h"

#include "llvm/ADT/Optional.h"

using namespace mlir;

using namespace mlir::gpu;

using namespace mlir::transform;

namespace {

/// A simple pattern rewriter that implements no special logic.

class SimpleRewriter : public PatternRewriter {

public:

SimpleRewriter(MLIRContext *context) : PatternRewriter(context) {}

};

} // namespace

/// Determines if the size of the kernel configuration is supported by the GPU

ftynseUnsubmitted

Done

Please document top-level functions.

ftynse: Please document top-level functions.

/// architecture being used. It presently makes use of CUDA limitations, however

/// that aspect may be enhanced for other GPUs.

static DiagnosedSilenceableFailure

checkGpuLimits(TransformOpInterface transformOp, Optional<int64_t> gridDimX,

Optional<int64_t> gridDimY, Optional<int64_t> gridDimZ,

Optional<int64_t> blockDimX, Optional<int64_t> blockDimY,

Optional<int64_t> blockDimZ) {

static constexpr int max_total_blockdim = 1024;

static constexpr int max_blockdimx = 1024;

static constexpr int max_blockdimy = 1024;

static constexpr int max_blockdimz = 64;

static constexpr int max_total_griddim = 2147483647;

static constexpr int max_griddimx = 2147483647;

static constexpr int max_griddimy = 65535;

static constexpr int max_griddimz = 65535;

if ((blockDimX.value_or(1) * blockDimY.value_or(1) * blockDimZ.value_or(1)) >

max_total_blockdim ||

(gridDimX.value_or(1) * gridDimY.value_or(1) * gridDimZ.value_or(1)) >

max_total_griddim ||

blockDimX.value_or(1) > max_blockdimx ||

blockDimY.value_or(1) > max_blockdimy ||

blockDimZ.value_or(1) > max_blockdimz ||

gridDimY.value_or(1) > max_griddimy ||

gridDimZ.value_or(1) > max_griddimz ||

gridDimX.value_or(1) > max_griddimx) {

return transformOp.emitSilenceableError()

ftynseUnsubmitted

Done

Nit: end comments with a full stop. Here and below.

ftynse: Nit: end comments with a full stop. Here and below.

<< "Trying to launch a GPU kernel with gridDim = ("

<< gridDimX.value_or(1) << ", " << gridDimY.value_or(1) << ", "

<< gridDimZ.value_or(1) << ") blockDim = (" << blockDimX.value_or(1)

<< ", " << blockDimY.value_or(1) << ", " << blockDimZ.value_or(1)

<< "). It is larger than the limits.";

}

return DiagnosedSilenceableFailure::success();

}

/// Creates an empty-body gpu::LaunchOp using the provided kernel settings and

/// put a terminator within.

static DiagnosedSilenceableFailure

createGpuLaunch(RewriterBase &rewriter, Location loc,

TransformOpInterface transformOp, LaunchOp &launchOp,

Optional<int64_t> gridDimX = llvm::None,

Optional<int64_t> gridDimY = llvm::None,

Optional<int64_t> gridDimZ = llvm::None,

Optional<int64_t> blockDimX = llvm::None,

Optional<int64_t> blockDimY = llvm::None,

Optional<int64_t> blockDimZ = llvm::None) {

DiagnosedSilenceableFailure diag =

ftynseUnsubmitted

Done

Nit: DiagnosedSIlenceableFailure(diag.checkAndReport()) is shorter and will not drop the notes from the diagnostic. We can also add some makeFailureDefinite() that turns silenceable failure into a definite one and propagates success.

ftynse: Nit: `DiagnosedSIlenceableFailure(diag.checkAndReport())` is shorter and will not drop the…

ftynseUnsubmitted

Done

On a second thought, why is this a definite failure? We usually use definite failures when the transformation could have modified the IR in an irrecoverable way. Definite failures are bubbled up really fast and stop the entire transformation process. This looks more like failed precondition and we can attempt something else. IMO, this should be just simple propagation`if (!diag.succeeded()) return diag;`.

I suppose that you have a longer transform combo in mind, something like "create gpu launch, then do something with it", but that part can also propagate silenceable failures. In the transform dialect context, if nothing explicitly suppresses the error (like transform.alternatives or sequence failures(suppress)), it will ultimately be reported. Here, by emitting a definite failure because of a precondition, you make it impossible to build an alternatives that tries to map to GPU and fallbacks to staying on CPU otherwise, for example.

ftynse: On a second thought, why is this a definite failure? We usually use definite failures when the…

gurayppAuthorUnsubmitted

Done

I return everything as a "silenceable failure" before modifying the IR. Returning "definitive failure" is useless due to several precondition checks. The user can specify "sequence failures(propagate)" if they want failures. Thanks for explaining this mechanism.

My misunderstanding is caused because "silenceable failure" does not show up with the "applyToOne" as I demonstrated at https://reviews.llvm.org/D134886.

Returning "silent failure" has the drawback that the results still need to be appropriately assigned. If results are not set, this time transform dialect itself fails because the number of expected results are not set.

guraypp: I return everything as a "silenceable failure" before modifying the IR. Returning "definitive…

ftynseUnsubmitted

Not Done

The issues with failure propagation should be fixed after https://reviews.llvm.org/D134948 lands.

For results, we can add/improve a helper that associates all transform op results (getNumResults) with an empty list of payload iR handles.

ftynse: The issues with failure propagation should be fixed after https://reviews.llvm.org/D134948…

checkGpuLimits(transformOp, gridDimX, gridDimY, gridDimZ, blockDimX,

blockDimY, blockDimZ);

if (!diag.succeeded())

ftynseUnsubmitted

Done

Nit: it may be a good idea to create an InsertionGuard at the entry of the function so the rewriter is reset back to where it was on return.

ftynse: Nit: it may be a good idea to create an InsertionGuard at the entry of the function so the…

return diag;

auto createConst = [&](int dim) {

return rewriter.create<arith::ConstantIndexOp>(loc, dim);

};

OpBuilder::InsertionGuard guard(rewriter);

Value one = createConst(1);

Value gridSizeX = gridDimX.has_value() ? createConst(gridDimX.value()) : one;

Value gridSizeY = gridDimY.has_value() ? createConst(gridDimY.value()) : one;

Value gridSizeZ = gridDimZ.has_value() ? createConst(gridDimZ.value()) : one;

Value blkSizeX = blockDimX.has_value() ? createConst(blockDimX.value()) : one;

Value blkSizeY = blockDimY.has_value() ? createConst(blockDimY.value()) : one;

Value blkSizeZ = blockDimZ.has_value() ? createConst(blockDimZ.value()) : one;

launchOp = rewriter.create<LaunchOp>(loc, gridSizeX, gridSizeY, gridSizeZ,

blkSizeX, blkSizeY, blkSizeZ);

rewriter.setInsertionPointToEnd(&launchOp.getBody().front());

rewriter.create<TerminatorOp>(loc);

return DiagnosedSilenceableFailure(success());

ftynseUnsubmitted

Done

Nit FYI: you can also do return op->emitError(), diagnostics are convertible to failure().

ftynse: Nit FYI: you can also do return op->emitError(), diagnostics are convertible to failure().

}

/// Alter kernel configuration of the given kernel.

static DiagnosedSilenceableFailure

alterGpuLaunch(SimpleRewriter &rewriter, LaunchOp gpuLaunch,

TransformOpInterface transformOp,

Optional<int64_t> gridDimX = llvm::None,

Optional<int64_t> gridDimY = llvm::None,

Optional<int64_t> gridDimZ = llvm::None,

Optional<int64_t> blockDimX = llvm::None,

Optional<int64_t> blockDimY = llvm::None,

Optional<int64_t> blockDimZ = llvm::None) {

DiagnosedSilenceableFailure diag =

checkGpuLimits(transformOp, gridDimX, gridDimY, gridDimZ, blockDimX,

ftynseUnsubmitted

Done

Ditto.

ftynse: Ditto.

blockDimY, blockDimZ);

if (!diag.succeeded())

return diag;

KernelDim3 currentBlockdim = gpuLaunch.getBlockSizeOperandValues();

OpBuilder::InsertionGuard guard(rewriter);

rewriter.setInsertionPointAfterValue(currentBlockdim.x);

auto createConstValue = [&](int dim) {

return rewriter.create<arith::ConstantIndexOp>(currentBlockdim.x.getLoc(),

dim);

};

if (gridDimX.has_value())

gpuLaunch.getGridSizeXMutable().assign(createConstValue(gridDimX.value()));

if (gridDimY.has_value())

gpuLaunch.getGridSizeYMutable().assign(createConstValue(gridDimY.value()));

if (gridDimZ.has_value())

gpuLaunch.getGridSizeZMutable().assign(createConstValue(gridDimZ.value()));

if (blockDimX.has_value())

gpuLaunch.getBlockSizeXMutable().assign(

createConstValue(blockDimX.value()));

if (blockDimY.has_value())

gpuLaunch.getBlockSizeYMutable().assign(

createConstValue(blockDimY.value()));

if (blockDimZ.has_value())

gpuLaunch.getBlockSizeZMutable().assign(

createConstValue(blockDimZ.value()));

return DiagnosedSilenceableFailure::success();

}

//===----------------------------------------------------------------------===//

ftynseUnsubmitted

Done

Please expand auto unless the type is clear from local context (there's a cast on the RHS) or difficult to spell (iterators, lambdas).

ftynse: Please expand `auto` unless the type is clear from local context (there's a cast on the RHS) or…

// MapForeachToBlocks

//===----------------------------------------------------------------------===//

DiagnosedSilenceableFailure mlir::transform::gpu::mapForeachToBlocksImp(

RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,

ftynseUnsubmitted

Done

Nit: this is a "non-trivial" conditional because of a multi-line condition and therefore needs braces around the body.

ftynse: Nit: this is a "non-trivial" conditional because of a multi-line condition and therefore needs…

function_ref<void(RewriterBase &, scf::ForeachThreadOp,

SmallVectorImpl<Value> &)>

blockIdGenerator,

SmallVectorImpl<int64_t> &gridDims, TransformOpInterface transformOp) {

if (foreachThreadOp.getNumResults() > 0)

return transformOp.emitSilenceableError()

<< "only bufferized scf.foreach_thread lowers to gpu.block_id";

if (foreachThreadOp.getNumThreads().size() > 3)

return transformOp.emitSilenceableError()

<< "scf.foreach_thread with rank > 3 does not lower to gpu.block_id";

// Step 0. Outline the compute workload region and set up the workload

// operands.

FailureOr<SmallVector<OpFoldResult>> potentialGridDim =

foreachThreadOp.getPermutedNumThreads(rewriter);

if (failed(potentialGridDim) ||

llvm::any_of(*potentialGridDim, [](OpFoldResult ofr) {

return !getConstantIntValue(ofr).has_value();

})) {

return transformOp.emitSilenceableError() << "unsupported dynamic gridDim";

ftynseUnsubmitted

Done

Nit: we have C++17, for (auto [threadIdx, blockOp] : llvm::zip(...)) should work here.

ftynse: Nit: we have C++17, `for (auto [threadIdx, blockOp] : llvm::zip(...))` should work here.

}

for (OpFoldResult ofr : *potentialGridDim)

gridDims.push_back(getConstantIntValue(ofr).value());

SmallVector<Value> blockOps;

ftynseUnsubmitted

Done

There's no point in jumping through the updateRootInPlace hoops if you are calling splice above. This function cannot be used with dialect conversion anyway.

ftynse: There's no point in jumping through the `updateRootInPlace` hoops if you are calling splice…

gurayppAuthorUnsubmitted

Done

I am not sure I understood this one. What else I can use instead of updateRootInPlace?

guraypp: I am not sure I understood this one. What else I can use instead of `updateRootInPlace`?

ftynseUnsubmitted

Done

Just ues replaceAllUsesWith directly, without wrapping into updateRootInPlace.

ftynse: Just ues replaceAllUsesWith directly, without wrapping into updateRootInPlace.

blockIdGenerator(rewriter, foreachThreadOp, blockOps);

// Step 1. Move the body of foreachThreadOp.

// Erase the terminator first, it will not be used since we are on buffers.

rewriter.eraseOp(foreachThreadOp.getTerminator());

Block *targetBlock = foreachThreadOp->getBlock();

Block::iterator insertionPoint = Block::iterator(foreachThreadOp);

Block &sourceBlock = foreachThreadOp.getRegion().front();

targetBlock->getOperations().splice(insertionPoint,

sourceBlock.getOperations());

// Step 2. RAUW thread indices to thread ops.

SmallVector<Value> threadIndices =

*foreachThreadOp.getPermutedThreadIndices();

assert(blockOps.size() == 3 && "3 block id ops are required");

ftynseUnsubmitted

Done

Note that this will advance over the "foreach(target-op(foreach()))" construct where the inner foreach is top-level within the given target but has _some_ other foreach ancestor. This should probably check that the foreach ancestor of another foreach has target as ancestor.

ftynse: Note that this will advance over the "foreach(target-op(foreach()))" construct where the inner…

ftynseUnsubmitted

Done

This doesn't seem fixed.

ftynse: This doesn't seem fixed.

gurayppAuthorUnsubmitted

Done

Good point! I noticed that it has many missing checks and this is one of them. For the case of " gpu::LaunchOp(foreach(target-op(foreach())))", it could be even gpu::LaunchOp before. It does not check that as well.

I will handle these checks in another PR with some tests if it's okay?

guraypp: Good point! I noticed that it has many missing checks and this is one of them. For the case of…

ftynseUnsubmitted

Done

Ok.

ftynse: Ok.

for (auto [blockIdx, blockOp] : llvm::zip(threadIndices, blockOps)) {

Value val = blockIdx;

ftynseUnsubmitted

Done

if (topLevelForeachThreadOp)

- // TODO Handle multiple foreach if there is no dependences between them

+ // TODO: Handle multiple foreach if there is no dependences between them.

return WalkResult::interrupt();

ftynse:

Value blkOp = blockOp;

if (!val)

continue;

for (Operation *user : llvm::make_early_inc_range(val.getUsers()))

user->replaceUsesOfWith(val, blkOp);

}

// Step 3. Erase old op.

rewriter.eraseOp(foreachThreadOp);

return DiagnosedSilenceableFailure::success();

}

ftynseUnsubmitted

Done

return topLevelForeachThreadOp;

}

- /// This is an helper that is only used in

+ /// This is a helper that is only used in

/// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id

ftynse:

DiagnosedSilenceableFailure mlir::transform::gpu::findTopLevelForeachThreadOp(

ftynseUnsubmitted

Done

/// This is an helper that is only used in

- /// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id

+ /// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id.

static void generateGpuBlockIds(RewriterBase &rewriter,

ftynse:

Operation *target, scf::ForeachThreadOp &topLevelForeachThreadOp,

TransformOpInterface transformOp) {

auto walkResult = target->walk([&](scf::ForeachThreadOp foreachThreadOp) {

if (foreachThreadOp->getParentOfType<scf::ForeachThreadOp>())

return WalkResult::advance();

if (topLevelForeachThreadOp)

// TODO: Handle multiple foreach if there is no dependences between them

return WalkResult::interrupt();

topLevelForeachThreadOp = foreachThreadOp;

return WalkResult::advance();

});

if (walkResult.wasInterrupted())

ftynseUnsubmitted

Not Done

Ditto.

ftynse: Ditto.

return transformOp.emitSilenceableError()

<< "could not find a unique topLevel scf.foreach_thread";

return DiagnosedSilenceableFailure::success();

}

/// This is a helper that is only used in

/// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id.

static void generateGpuBlockIds(RewriterBase &rewriter,

scf::ForeachThreadOp foreachOp,

SmallVectorImpl<Value> &blockOps) {

Location loc = foreachOp->getLoc();

OpBuilder::InsertionGuard guard(rewriter);

ftynseUnsubmitted

Done

IMO, this should be silenceable failure. We haven't irreversibly modified the IR yet.

ftynse: IMO, this should be silenceable failure. We haven't irreversibly modified the IR yet.

rewriter.setInsertionPoint(foreachOp);

IndexType indexType = rewriter.getIndexType();

SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};

ftynseUnsubmitted

Done

Please expand auto.

ftynse: Please expand auto.

for (int64_t idx : llvm::seq<int64_t>(0, gpuDims.size())) {

blockOps.push_back(

ftynseUnsubmitted

Done

Can we rather return a DiagnosedSilenceableFailure from findTopLevelForeachThreadOp instead of LogicalResult? Unknown errors are almost useless.

ftynse: Can we rather return a DiagnosedSilenceableFailure from findTopLevelForeachThreadOp instead of…

rewriter.create<BlockIdOp>(loc, indexType, gpuDims[idx]));

}

DiagnosedSilenceableFailure

transform::MapForeachToBlocks::applyToOne(Operation *target,

SmallVectorImpl<Operation *> &results,

transform::TransformState &state) {

LaunchOp gpuLaunch = dyn_cast<LaunchOp>(target);

SimpleRewriter rewriter(getContext());

auto transformOp = cast<TransformOpInterface>(getOperation());

ftynseUnsubmitted

Done

Same as above.

ftynse: Same as above.

if (!getGenerateGpuLaunch() && !gpuLaunch) {

results.assign({target});

DiagnosedSilenceableFailure diag =

emitSilenceableError()

ftynseUnsubmitted

Done

DiagnosedSilenceableFailure diag = emitSilenceableError() << ""message";
diag.attachNote(target->getLoc()) << "when applied to this payload op";
return diag;

ftynse: ``` DiagnosedSilenceableFailure diag = emitSilenceableError() << ""message"; diag.attachNote…

<< "Given target is not gpu.launch, set `generate_gpu_launch` "

ftynseUnsubmitted

Done

Use cast if the result is never checked for being non-null.

ftynse: Use `cast` if the result is never checked for being non-null.

"attribute";

diag.attachNote(target->getLoc()) << "when applied to this payload op";

return diag;

}

scf::ForeachThreadOp topLevelForeachThreadOp;

DiagnosedSilenceableFailure diag =

mlir::transform::gpu::findTopLevelForeachThreadOp(

target, topLevelForeachThreadOp, transformOp);

if (!diag.succeeded()) {

results.assign({target});

diag.attachNote(target->getLoc()) << "when applied to this payload op";

return diag;

}

OpBuilder::InsertionGuard guard(rewriter);

rewriter.setInsertionPoint(topLevelForeachThreadOp);

// Generate gpu launch here and move the foreach_thread inside

if (getGenerateGpuLaunch()) {

DiagnosedSilenceableFailure diag =

createGpuLaunch(rewriter, target->getLoc(), transformOp, gpuLaunch);

if (!diag.succeeded()) {

results.assign({target});

return diag;

}

rewriter.setInsertionPointToStart(&gpuLaunch.getBody().front());

Operation *newForeachThreadOp = rewriter.clone(*topLevelForeachThreadOp);

rewriter.eraseOp(topLevelForeachThreadOp);

topLevelForeachThreadOp = cast<scf::ForeachThreadOp>(newForeachThreadOp);

}

ftynseUnsubmitted

Done

Nit: there is no gpu.thread operation, did you mean thread_id?

ftynse: Nit: there is no `gpu.thread` operation, did you mean `thread_id`?

SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());

diag = mlir::transform::gpu::mapForeachToBlocksImp(

rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim,

transformOp);

if (diag.succeeded()) {

diag = alterGpuLaunch(rewriter, gpuLaunch,

cast<TransformOpInterface>(getOperation()),

gridDim[0], gridDim[1], gridDim[2]);

}

results.assign({gpuLaunch});

return diag;

}

//===----------------------------------------------------------------------===//

// MapNestedForeachToThreads

//===----------------------------------------------------------------------===//

/// Searches `scf.foreach_thread` ops nested under `target` and maps each such

/// op to GPU threads. Mapping is one-to-one and the induction variables of

/// `scf.foreach_thread` are rewritten to gpu.thread_id according to the

/// thread_dim_apping attribute. Sibling `scf.foreach_thread` are supported in

/// which case, the union of the number of threads is computed and may result

/// in predication. Dynamic, `scf.foreach_thread` trip counts are currently

/// not supported. Dynamic block dim sizes are currently not supported.

static DiagnosedSilenceableFailure rewriteOneForeachThreadToGpuThreads(

RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,

const SmallVectorImpl<int64_t> &globalBlockDims, bool syncAfterDistribute,

llvm::Optional<TransformOpInterface> transformOp) {

auto failureHelper =

[&](const Twine &message) -> DiagnosedSilenceableFailure {

if (transformOp.has_value()) {

return transformOp->emitSilenceableError() << message;

}

foreachThreadOp->emitError() << message;

return DiagnosedSilenceableFailure::definiteFailure();

};

if (foreachThreadOp.getNumResults() > 0)

ftynseUnsubmitted

Done

Nit: creating IR inside another IR-creation call is highly discouraged. This specifically is not problematic because there will be no ordering issue given there's only one nested operation, but some later modification may add a second one leading to unspecified order in some compilers.

ftynse: Nit: creating IR inside another IR-creation call is highly discouraged. This specifically is…

gurayppAuthorUnsubmitted

Done

Could you please elaborate this comment?

guraypp: Could you please elaborate this comment?

ftynseUnsubmitted

Done

If you do something like b.create<MulFOp>(loc, b.create<AddFOp>(...), b.create<AddFOp>(...)), the order in which the "add" operations will be created is unspecified because the order of function argument evaluation is unspecified and does vary between compilers/platforms. Meaning we cannot reliably test the created IR since we don't know the order of operations in it. This is not a problem if only _one_ of the operands creates additional IR, which is the case here. But it makes it easier to accidentally add a second such call during refactoring, especially if it is hidden in a function call. So we decided at some point that it is safer to _never_ call "create" within the arguments of another "create", not even once. Avoiding such nested create also makes the C++ code slightly more similar to the IR produced: there's one statement per produced operation.

ftynse: If you do something like `b.create<MulFOp>(loc, b.create<AddFOp>(...), b.create<AddFOp>(...))`…

gurayppAuthorUnsubmitted

Done

Thanks for the explanation, now I understand what you mean. Yes, the order of generating arguments is undefined. I did an experiment in the link below, gcc and clang on x86 generates the order in the different way. I put here for in case someone wonders in the future :)
https://godbolt.org/z/TYf3cqxx5

guraypp: Thanks for the explanation, now I understand what you mean. Yes, the order of generating…

return failureHelper(

"only bufferized scf.foreach_thread lowers to gpu.thread_id");

if (foreachThreadOp.getNumThreads().size() > 3)

return failureHelper(

"scf.foreach_thread with rank > 3 does not lower to gpu.thread_id");

auto potentialBlockDim = foreachThreadOp.getPermutedNumThreads(rewriter);

if (failed(potentialBlockDim) ||

llvm::any_of(*potentialBlockDim, [](OpFoldResult ofr) {

return !getConstantIntValue(ofr).has_value();

})) {

return failureHelper("unsupported dynamic blockdim size");

}

SmallVector<int64_t> blockDim =

llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {

return getConstantIntValue(ofr).value();

}));

// Step 1. Create the gpu.thread ops

Location loc = foreachThreadOp.getLoc();

IndexType indexType = rewriter.getIndexType();

SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};

SmallVector<Value> threadOps;

for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {

threadOps.push_back(

rewriter.create<ThreadIdOp>(loc, indexType, gpuDims[idx]));

}

// Step 2. Maybe create conditionals to predicate the region.

Value predicate;

for (auto [threadId, blockDim, globalBlockDim] :

llvm::zip(threadOps, blockDim, globalBlockDims)) {

ftynseUnsubmitted

Done

Same comment as above about RAUW and splice.

ftynse: Same comment as above about RAUW and splice.

if (blockDim > globalBlockDim) {

return failureHelper(

"The GPU threads are fewer than the loop trip counts. "

"Try to tile scf.foreach_thread before mapping.");

}

if (blockDim == globalBlockDim)

continue;

Value blockIdx = rewriter.create<arith::ConstantIndexOp>(loc, blockDim);

Value tmpPredicate = rewriter.create<arith::CmpIOp>(

loc, arith::CmpIPredicate::ult, threadId, blockIdx);

predicate =

predicate ? rewriter.create<arith::AndIOp>(loc, predicate, tmpPredicate)

: tmpPredicate;

}

// Step 3. Move the body of foreachThreadOp.

// Erase the terminator first, it will not be used.

rewriter.eraseOp(foreachThreadOp.getTerminator());

Block *targetBlock;

Block::iterator insertionPoint;

if (predicate) {

// Step 3.a. If predicated, move at the beginning.

auto ifOp =

rewriter.create<scf::IfOp>(loc, predicate, /*withElseRegion=*/false);

targetBlock = ifOp.thenBlock();

insertionPoint = ifOp.thenBlock()->begin();

} else {

// Step 3.a. Otherwise, move inline just before foreachThreadOp.

targetBlock = foreachThreadOp->getBlock();

insertionPoint = Block::iterator(foreachThreadOp);

}

Block &sourceBlock = foreachThreadOp.getRegion().front();

targetBlock->getOperations().splice(insertionPoint,

sourceBlock.getOperations());

// Step 4. RAUW thread indices to thread ops.

SmallVector<Value> threadIndices =

*foreachThreadOp.getPermutedThreadIndices();

for (auto [threadIdx, threadOp] : llvm::zip(threadIndices, threadOps)) {

Value val = threadIdx;

Value op = threadOp;

if (!val)

continue;

for (Operation *user : llvm::make_early_inc_range(val.getUsers())) {

user->replaceUsesOfWith(val, op);

}

// Step 5. syncthreads.

// TODO: Need warpsync

if (syncAfterDistribute)

rewriter.create<BarrierOp>(loc);

// Step 6. Erase old op.

rewriter.eraseOp(foreachThreadOp);

return DiagnosedSilenceableFailure::success();

}

DiagnosedSilenceableFailure mlir::transform::gpu::mapNestedForeachToThreadsImp(

RewriterBase &rewriter, Operation *target,

const SmallVectorImpl<int64_t> &blockDim, bool syncAfterDistribute,

llvm::Optional<TransformOpInterface> transformOp) {

DiagnosedSilenceableFailure diag = DiagnosedSilenceableFailure::success();

target->walk([&](scf::ForeachThreadOp foreachThreadOp) {

rewriter.setInsertionPoint(foreachThreadOp);

diag = rewriteOneForeachThreadToGpuThreads(

rewriter, foreachThreadOp, blockDim, syncAfterDistribute, transformOp);

return diag.succeeded() ? WalkResult::advance() : WalkResult::interrupt();

});

return diag;

}

ftynseUnsubmitted

Done

Leftover debug

ftynse: Leftover debug

DiagnosedSilenceableFailure transform::MapNestedForeachToThreads::applyToOne(

::mlir::Operation *target,

::llvm::SmallVectorImpl<::mlir::Operation *> &results,

::mlir::transform::TransformState &state) {

LaunchOp gpuLaunch = dyn_cast<LaunchOp>(target);

auto transformOp = cast<TransformOpInterface>(getOperation());

if (!gpuLaunch) {

results.assign({target});

return emitSilenceableError() << "Given target is not gpu.launch";

}

SmallVector<int64_t> blockDim = extractFromI64ArrayAttr(getBlockDim());

blockDim.resize(/*size=*/3, /*value=*/1);

DiagnosedSilenceableFailure diag =

checkGpuLimits(transformOp, llvm::None, llvm::None, llvm::None,

blockDim[0], blockDim[1], blockDim[2]);

if (diag.isSilenceableFailure()) {

results.assign({target});

diag.attachNote(getLoc()) << getBlockDimAttrName() << " is very large";

return diag;

}

SimpleRewriter rewriter(getContext());

rewriter.setInsertionPoint(target);

diag = mlir::transform::gpu::mapNestedForeachToThreadsImp(

rewriter, target, blockDim, getSyncAfterDistribute(), llvm::None);

if (diag.succeeded()) {

diag =

alterGpuLaunch(rewriter, gpuLaunch, transformOp, llvm::None, llvm::None,

llvm::None, blockDim[0], blockDim[1], blockDim[2]);

}

results.assign({gpuLaunch});

return diag;

}

//===----------------------------------------------------------------------===//

// Transform op registration

//===----------------------------------------------------------------------===//

namespace {

/// Registers new ops and declares PDL as dependent dialect since the

/// additional ops are using PDL types for operands and results.

class GPUTransformDialectExtension

: public transform::TransformDialectExtension<

GPUTransformDialectExtension> {

public:

GPUTransformDialectExtension() {

declareDependentDialect<pdl::PDLDialect>();

declareGeneratedDialect<scf::SCFDialect>();

declareGeneratedDialect<arith::ArithDialect>();

declareGeneratedDialect<GPUDialect>();

registerTransformOps<

#define GET_OP_LIST

#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.cpp.inc"

>();

}

};

} // namespace

#define GET_OP_CLASSES

#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.cpp.inc"

void mlir::gpu::registerTransformDialectExtension(DialectRegistry &registry) {

registry.addExtensions<GPUTransformDialectExtension>();

}

ftynseUnsubmitted

Done

Please add the newline.

ftynse: Please add the newline.

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

Show First 20 Lines • Show All 1,162 Lines • ▼ Show 20 Lines	void transform::TileOp::getEffects(
consumesHandle(getTarget(), effects);		consumesHandle(getTarget(), effects);
onlyReadsHandle(getDynamicSizes(), effects);		onlyReadsHandle(getDynamicSizes(), effects);
producesHandle(getTiledLinalgOp(), effects);		producesHandle(getTiledLinalgOp(), effects);
producesHandle(getLoops(), effects);		producesHandle(getLoops(), effects);
modifiesPayload(effects);		modifiesPayload(effects);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// MapNestedForeachThreadToGpuThreads
//===----------------------------------------------------------------------===//

/// Searches `scf.foreach_thread` ops nested under `target` and maps each such
/// op to GPU threads. Mapping is one-to-one and the induction variables of
/// `scf.foreach_thread` are rewritten to gpu.thread_id according to the
/// thread_dim_apping attribute. Sibling `scf.foreach_thread` are supported in
/// which case, the union of the number of threads is computed and may result in
/// predication. Dynamic, `scf.foreach_thread` trip counts are currently not
/// supported. Dynamic block dim sizes are currently not supported.
static FailureOr<SmallVector<OpFoldResult>> rewriteOneForeachThreadToGpuThreads(
RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,
const SmallVector<int64_t> &globalBlockDims, bool syncAfterDistribute) {
if (foreachThreadOp.getNumResults() > 0)
return foreachThreadOp->emitError(
"only bufferized scf.foreach_thread lowers to gpu.thread");
if (foreachThreadOp.getNumThreads().size() > 3)
return foreachThreadOp->emitError(
"scf.foreach_thread with rank > 3 does not lower to gpu.thread");

auto potentialBlockDim = foreachThreadOp.getPermutedNumThreads(rewriter);
if (failed(potentialBlockDim) \|\|
llvm::any_of(*potentialBlockDim, [](OpFoldResult ofr) {
return !getConstantIntValue(ofr).has_value();
}))
return foreachThreadOp->emitError("unsupported dynamic blockdim size");

SmallVector<int64_t> blockDim =
llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {
return getConstantIntValue(ofr).value();
}));

// Step 1. Create the gpu.thread ops
Location loc = foreachThreadOp.getLoc();
IndexType indexType = rewriter.getIndexType();

SmallVector<gpu::Dimension> gpuDims{gpu::Dimension::x, gpu::Dimension::y,
gpu::Dimension::z};
SmallVector<Value> threadOps;
for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {
threadOps.push_back(
rewriter.create<gpu::ThreadIdOp>(loc, indexType, gpuDims[idx]));
}
// Step 2. Maybe create conditionals to predicate the region.
Value predicate;
for (auto [threadId, blockDim, globalBlockDim] :
llvm::zip(threadOps, blockDim, globalBlockDims)) {
if (blockDim > globalBlockDim) {
return foreachThreadOp.emitOpError("blockDim size overflow: ")
<< blockDim << " > " << globalBlockDim;
}
if (blockDim == globalBlockDim)
continue;
Value tmpPredicate = rewriter.create<arith::CmpIOp>(
loc, arith::CmpIPredicate::ult, threadId,
rewriter.create<arith::ConstantIndexOp>(loc, blockDim));
predicate =
predicate ? rewriter.create<arith::AndIOp>(loc, predicate, tmpPredicate)
: tmpPredicate;
}

// Step 3. Move the body of foreachThreadOp.
// Erase the terminator first, it will not be used.
rewriter.eraseOp(foreachThreadOp.getTerminator());
Block *targetBlock;
Block::iterator insertionPoint;
if (predicate) {
// Step 3.a. If predicated, move at the beginning.
auto ifOp =
rewriter.create<scf::IfOp>(loc, predicate, /withElseRegion=/false);
targetBlock = ifOp.thenBlock();
insertionPoint = ifOp.thenBlock()->begin();
} else {
// Step 3.a. Otherwise, move inline just before foreachThreadOp.
targetBlock = foreachThreadOp->getBlock();
insertionPoint = Block::iterator(foreachThreadOp);
}
Block &sourceBlock = foreachThreadOp.getRegion().front();
targetBlock->getOperations().splice(insertionPoint,
sourceBlock.getOperations());

// Step 4. RAUW thread indices to thread ops.
SmallVector<Value> threadIndices =
*foreachThreadOp.getPermutedThreadIndices();
for (auto it : llvm::zip(threadIndices, threadOps)) {
Value val = std::get<0>(it);
if (!val)
continue;
for (Operation *user : llvm::make_early_inc_range(val.getUsers())) {
rewriter.updateRootInPlace(
user, [&]() { user->replaceUsesOfWith(val, std::get<1>(it)); });
}
}

// Step 5. syncthreads.
// TODO: Need warpsync
if (syncAfterDistribute)
rewriter.create<gpu::BarrierOp>(loc);

// Step 6. Erase old op.
rewriter.eraseOp(foreachThreadOp);

return *potentialBlockDim;
}

mlir::WalkResult mlir::linalg::rewriteMapNestedForeachThreadToGpuThreads(
RewriterBase &rewriter, Operation *target,
const SmallVector<int64_t> &blockDim, bool syncAfterDistribute) {
auto walkResult = target->walk([&](scf::ForeachThreadOp foreachThreadOp) {
rewriter.setInsertionPoint(foreachThreadOp);
if (failed(rewriteOneForeachThreadToGpuThreads(
rewriter, foreachThreadOp, blockDim, syncAfterDistribute)))
return WalkResult::interrupt();
return WalkResult::advance();
});
return walkResult;
}

static LogicalResult
checkGpuLimits(Optional<int64_t> gridDimX, Optional<int64_t> gridDimY,
Optional<int64_t> gridDimZ, Optional<int64_t> blockDimX,
Optional<int64_t> blockDimY, Optional<int64_t> blockDimZ) {
// TODO The limits should live in the gpu dialect, but it's not like that
// right now. Read them in the common gpu dialect
if ((blockDimX.value_or(1) * blockDimY.value_or(1) * blockDimZ.value_or(1)) >
1024 \|\|
gridDimY.value_or(1) > 65535 \|\| gridDimZ.value_or(1) > 65535 \|\|
gridDimX.value_or(1) > 2147483647)
return failure();
return success();
}

/// Alter grid or block dimensions of the given kernel
static LogicalResult alterGpuLaunch(SimpleRewriter &rewriter,
gpu::LaunchOp gpuLaunch,
Optional<int64_t> gridDimX = llvm::None,
Optional<int64_t> gridDimY = llvm::None,
Optional<int64_t> gridDimZ = llvm::None,
Optional<int64_t> blockDimX = llvm::None,
Optional<int64_t> blockDimY = llvm::None,
Optional<int64_t> blockDimZ = llvm::None) {
if (failed(checkGpuLimits(gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY,
blockDimZ))) {
gpuLaunch->emitError(
"Requested kernel thread configuration is larger than the limits");
return failure();
}

gpu::KernelDim3 currentBlockdim = gpuLaunch.getBlockSizeOperandValues();
OpBuilder::InsertionGuard guard(rewriter);
rewriter.setInsertionPointAfterValue(currentBlockdim.x);
auto createConstValue = [&](int dim) {
return rewriter.create<arith::ConstantIndexOp>(currentBlockdim.x.getLoc(),
dim);
};

if (gridDimX.has_value())
gpuLaunch.getGridSizeXMutable().assign(createConstValue(gridDimX.value()));
if (gridDimY.has_value())
gpuLaunch.getGridSizeYMutable().assign(createConstValue(gridDimY.value()));
if (gridDimZ.has_value())
gpuLaunch.getGridSizeZMutable().assign(createConstValue(gridDimZ.value()));
if (blockDimX.has_value())
gpuLaunch.getBlockSizeXMutable().assign(
createConstValue(blockDimX.value()));
if (blockDimY.has_value())
gpuLaunch.getBlockSizeYMutable().assign(
createConstValue(blockDimY.value()));
if (blockDimZ.has_value())
gpuLaunch.getBlockSizeZMutable().assign(
createConstValue(blockDimZ.value()));
return success();
}

DiagnosedSilenceableFailure
transform::MapNestedForeachThreadToGpuThreads::applyToOne(
Operation target, SmallVectorImpl<Operation > &results,
transform::TransformState &state) {

gpu::LaunchOp gpuLaunch = dyn_cast<gpu::LaunchOp>(target);
if (!gpuLaunch) {
target->emitError("Given target is not gpu.launch");
return DiagnosedSilenceableFailure::definiteFailure();
}

SmallVector<int64_t> blockDim = extractFromI64ArrayAttr(getBlockDim());
blockDim.resize(/size=/3, /value=/1);
SimpleRewriter rewriter(getContext());
rewriter.setInsertionPoint(target);
auto walkResult = mlir::linalg::rewriteMapNestedForeachThreadToGpuThreads(
rewriter, target, blockDim, getSyncAfterDistribute());
if (walkResult.wasInterrupted())
return DiagnosedSilenceableFailure(reportUnknownTransformError(target));

LogicalResult result =
alterGpuLaunch(rewriter, gpuLaunch, llvm::None, llvm::None, llvm::None,
blockDim[0], blockDim[1], blockDim[2]);
if (failed(result))
return DiagnosedSilenceableFailure::definiteFailure();

results.assign({target});
return DiagnosedSilenceableFailure(success());
}

//===----------------------------------------------------------------------===//
// MapNestedForeachThreadToGpuBlocks
//===----------------------------------------------------------------------===//

LogicalResult mlir::linalg::rewriteTopLevelForeachThreadToGpuBlocks(
RewriterBase &rewriter, scf::ForeachThreadOp foreachThreadOp,
function_ref<void(RewriterBase &, scf::ForeachThreadOp,
SmallVector<Value> &)>
blockIdGenerator,
SmallVector<int64_t> &gridDims) {
if (foreachThreadOp.getNumResults() > 0)
return foreachThreadOp->emitError(
"only bufferized scf.foreach_thread lowers to gpu.block_id");
if (foreachThreadOp.getNumThreads().size() > 3)
return foreachThreadOp->emitError(
"scf.foreach_thread with rank > 3 does not lower to gpu.block_id");

// Step 0. Outline the compute workload region and set up the workload
// operands.
auto potentialGridDim = foreachThreadOp.getPermutedNumThreads(rewriter);
if (failed(potentialGridDim) \|\|
llvm::any_of(*potentialGridDim, [](OpFoldResult ofr) {
return !getConstantIntValue(ofr).has_value();
}))
return foreachThreadOp->emitError("unsupported dynamic gridDim");

for (OpFoldResult ofr : *potentialGridDim)
gridDims.push_back(getConstantIntValue(ofr).value());

SmallVector<Value> blockOps;
blockIdGenerator(rewriter, foreachThreadOp, blockOps);

// Step 1. Move the body of foreachThreadOp.
// Erase the terminator first, it will not be used since we are on buffers.
rewriter.eraseOp(foreachThreadOp.getTerminator());
Block *targetBlock = foreachThreadOp->getBlock();
Block::iterator insertionPoint = Block::iterator(foreachThreadOp);
Block &sourceBlock = foreachThreadOp.getRegion().front();
targetBlock->getOperations().splice(insertionPoint,
sourceBlock.getOperations());

// Step 2. RAUW thread indices to thread ops.
SmallVector<Value> threadIndices =
*foreachThreadOp.getPermutedThreadIndices();
assert(blockOps.size() == 3 && "3 block id ops are required");
for (auto it : llvm::zip(threadIndices, blockOps)) {
Value val = std::get<0>(it);
if (!val)
continue;
for (Operation *user : llvm::make_early_inc_range(val.getUsers())) {
rewriter.updateRootInPlace(
user, [&]() { user->replaceUsesOfWith(val, std::get<1>(it)); });
}
}

// Step 3. Erase old op.
rewriter.eraseOp(foreachThreadOp);

return success();
}

FailureOr<scf::ForeachThreadOp>
mlir::linalg::findTopLevelForeachThreadOp(Operation *target) {
scf::ForeachThreadOp topLevelForeachThreadOp;
auto walkResult = target->walk([&](scf::ForeachThreadOp foreachThreadOp) {
if (foreachThreadOp->getParentOfType<scf::ForeachThreadOp>())
return WalkResult::advance();
if (topLevelForeachThreadOp)
// TODO Handle multiple foreach if there is no dependences between them
return WalkResult::interrupt();
topLevelForeachThreadOp = foreachThreadOp;
return WalkResult::advance();
});

if (walkResult.wasInterrupted())
return target->emitError(
"could not find a unique topLevel scf.foreach_thread");

return topLevelForeachThreadOp;
}

/// Create gpuLauncOp with given kernel configurations
static FailureOr<gpu::LaunchOp>
createGpuLaunch(RewriterBase &rewriter, Location loc,
Optional<int64_t> gridDimX = llvm::None,
Optional<int64_t> gridDimY = llvm::None,
Optional<int64_t> gridDimZ = llvm::None,
Optional<int64_t> blockDimX = llvm::None,
Optional<int64_t> blockDimY = llvm::None,
Optional<int64_t> blockDimZ = llvm::None) {
if (failed(checkGpuLimits(gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY,
blockDimZ)))
return failure();
auto createConstant = [&](int dim) {
return rewriter.create<arith::ConstantIndexOp>(loc, dim);
};
Value one = createConstant(1);
Value gridSizeX =
gridDimX.has_value() ? createConstant(gridDimX.value()) : one;
Value gridSizeY =
gridDimY.has_value() ? createConstant(gridDimY.value()) : one;
Value gridSizeZ =
gridDimZ.has_value() ? createConstant(gridDimZ.value()) : one;
Value blockSizeX =
blockDimX.has_value() ? createConstant(blockDimX.value()) : one;
Value blockSizeY =
blockDimY.has_value() ? createConstant(blockDimY.value()) : one;
Value blockSizeZ =
blockDimZ.has_value() ? createConstant(blockDimZ.value()) : one;
auto launchOp = rewriter.create<gpu::LaunchOp>(
loc, gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ);
rewriter.setInsertionPointToEnd(&launchOp.getBody().front());
rewriter.create<gpu::TerminatorOp>(loc);
return launchOp;
}

/// This is an helper that is only used in
/// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id
static void generateGpuBlockIds(RewriterBase &rewriter,
scf::ForeachThreadOp foreachOp,
SmallVector<Value> &blockOps) {
Location loc = foreachOp->getLoc();
OpBuilder::InsertionGuard guard(rewriter);
rewriter.setInsertionPoint(foreachOp);
IndexType indexType = rewriter.getIndexType();
SmallVector<gpu::Dimension> gpuDims{gpu::Dimension::x, gpu::Dimension::y,
gpu::Dimension::z};
for (int64_t idx : llvm::seq<int64_t>(0, gpuDims.size())) {
blockOps.push_back(
rewriter.create<gpu::BlockIdOp>(loc, indexType, gpuDims[idx]));
}
}

DiagnosedSilenceableFailure
transform::MapNestedForeachThreadToGpuBlocks::applyToOne(
Operation target, SmallVectorImpl<Operation > &results,
transform::TransformState &state) {
gpu::LaunchOp gpuLaunch = dyn_cast<gpu::LaunchOp>(target);
SimpleRewriter rewriter(getContext());

if (!getGenerateGpuLaunch() && !gpuLaunch) {
target->emitError("Given target is not gpu.launch, set "
"`generate_gpu_launch` attribute");
return DiagnosedSilenceableFailure::definiteFailure();
}

auto res = mlir::linalg::findTopLevelForeachThreadOp(target);
if (failed(res))
return DiagnosedSilenceableFailure(reportUnknownTransformError(target));

scf::ForeachThreadOp topLevelForeachThreadOp = *res;
OpBuilder::InsertionGuard guard(rewriter);
rewriter.setInsertionPoint(topLevelForeachThreadOp);

// Generate gpu launch here and move the foreach_thread inside
if (getGenerateGpuLaunch()) {
FailureOr<gpu::LaunchOp> maybeGpuLaunch =
createGpuLaunch(rewriter, target->getLoc());
if (failed(maybeGpuLaunch))
return DiagnosedSilenceableFailure(reportUnknownTransformError(target));
gpuLaunch = *maybeGpuLaunch;
rewriter.setInsertionPointToStart(&gpuLaunch.getBody().front());
Operation newForeachThreadOp = rewriter.clone(topLevelForeachThreadOp);
rewriter.eraseOp(topLevelForeachThreadOp);
topLevelForeachThreadOp =
dyn_cast<scf::ForeachThreadOp>(newForeachThreadOp);
}

SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());
if (failed(mlir::linalg::rewriteTopLevelForeachThreadToGpuBlocks(
rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim)))
return DiagnosedSilenceableFailure(reportUnknownTransformError(target));

if (failed(alterGpuLaunch(rewriter, gpuLaunch, gridDim[0], gridDim[1],
gridDim[2])))
return DiagnosedSilenceableFailure::definiteFailure();

results.assign({gpuLaunch});
return DiagnosedSilenceableFailure(success());
}

//===----------------------------------------------------------------------===//
// TileToForeachThreadOp		// TileToForeachThreadOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

DiagnosedSilenceableFailure transform::tileToForeachThreadOpImpl(		DiagnosedSilenceableFailure transform::tileToForeachThreadOpImpl(
RewriterBase &rewriter, transform::TransformState &state,		RewriterBase &rewriter, transform::TransformState &state,
TransformOpInterface transformOp, ArrayRef<Operation *> targets,		TransformOpInterface transformOp, ArrayRef<Operation *> targets,
ArrayRef<OpFoldResult> mixedNumThreads,		ArrayRef<OpFoldResult> mixedNumThreads,
ArrayRef<OpFoldResult> mixedTileSizes, Optional<ArrayAttr> threadDimMapping,		ArrayRef<OpFoldResult> mixedTileSizes, Optional<ArrayAttr> threadDimMapping,
▲ Show 20 Lines • Show All 237 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/transform-gpu.mlir

This file was moved from mlir/test/Dialect/Linalg/transform-gpu.mlir.

Show All 29 Lines	// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]]]
return %y : !type		return %y : !type
}		}

transform.with_pdl_patterns {		transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 failures(propagate) {		transform.sequence %arg0 failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.structured.map_nested_foreach_thread_to_gpu_blocks %funcop { blockDim = [12, 9, 1]}		transform.gpu.map_foreach_to_blocks %funcop { blockDim = [12, 9, 1]}
}		}
}		}

// -----		// -----

!type = memref<2 x 32 x f32>		!type = memref<2 x 32 x f32>
!type1d = memref<32 x f32>		!type1d = memref<32 x f32>

Show All 40 Lines	// CHECK: gpu.barrier
return %y : !type		return %y : !type
}		}

transform.with_pdl_patterns {		transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 failures(propagate) {		transform.sequence %arg0 failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.structured.map_nested_foreach_thread_to_gpu_threads %funcop { blockDim = [12, 9, 1] }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1] }
}		}
}		}

// -----		// -----

!type4d = memref<32x64x4x32xf32>		!type4d = memref<32x64x4x32xf32>

// CHECK-LABEL: func.func @saxpy4d(		// CHECK-LABEL: func.func @saxpy4d(
Show All 25 Lines	// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]
return %y : !type4d		return %y : !type4d
}		}

transform.with_pdl_patterns {		transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 failures(propagate) {		transform.sequence %arg0 failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
%gpuLaunch = transform.structured.map_nested_foreach_thread_to_gpu_blocks %funcop { generate_gpu_launch }		%gpuLaunch = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
transform.structured.map_nested_foreach_thread_to_gpu_threads %gpuLaunch { blockDim = [32, 4, 1] }		transform.gpu.map_nested_foreach_to_threads %gpuLaunch { blockDim = [32, 4, 1] }
}		}
}		}

// -----		// -----

!type = memref<2 x 32 x f32>		!type = memref<2 x 32 x f32>
!type1d = memref<32 x f32>		!type1d = memref<32 x f32>

Show All 19 Lines	// CHECK: return
return %y : !type		return %y : !type
}		}

transform.with_pdl_patterns {		transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 failures(propagate) {		transform.sequence %arg0 failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.structured.map_nested_foreach_thread_to_gpu_threads %funcop { blockDim = [12, 9, 1], syncAfterDistribute = false }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1], syncAfterDistribute = false }
}		}
}		}

mlir/test/Dialect/Linalg/transform-gpu.mlir

This file was moved to mlir/test/Dialect/GPU/transform-gpu.mlir.

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Show First 20 Lines • Show All 3,756 Lines • ▼ Show 20 Lines	] + if_cuda_available([
":NVVMToLLVMIRTranslation",		":NVVMToLLVMIRTranslation",
"//llvm:NVPTXCodeGen",		"//llvm:NVPTXCodeGen",
"@cuda//:cuda_headers",		"@cuda//:cuda_headers",
"@cuda//:libcuda",		"@cuda//:libcuda",
]),		]),
)		)

td_library(		td_library(
		name = "GPUTransformOpsTdFiles",
		srcs = [
		"include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td",
		],
		includes = ["include"],
		deps = [
		":PDLDialectTdFiles",
		":TransformDialectTdFiles",
		],
		)

		gentbl_cc_library(
		name = "GPUTransformOpsIncGen",
		strip_include_prefix = "include",
		tbl_outs = [
		(
		["-gen-op-decls"],
		"include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.h.inc",
		),
		(
		["-gen-op-defs"],
		"include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.cpp.inc",
		),
		],
		tblgen = ":mlir-tblgen",
		td_file = "include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td",
		deps = [
		":GPUTransformOpsTdFiles",
		],
		)

		cc_library(
		name = "GPUTransformOps",
		srcs = [
		"lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp",
		],
		hdrs = [
		"include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.h",
		],
		includes = ["include"],
		deps = [
		":ArithDialect",
		":AsmParser",
		":ControlFlowDialect",
		":GPUDialect",
		":GPUTransformOpsIncGen",
		":GPUTransforms",
		":IR",
		":PDLDialect",
		":Parser",
		":SCFDialect",
		":SideEffectInterfaces",
		":TransformDialect",
		":TransformUtils",
		"//llvm:Support",
		],
		)

		td_library(
name = "LLVMOpsTdFiles",		name = "LLVMOpsTdFiles",
srcs = [		srcs = [
"include/mlir/Dialect/LLVMIR/LLVMIntrinsicOps.td",		"include/mlir/Dialect/LLVMIR/LLVMIntrinsicOps.td",
"include/mlir/Dialect/LLVMIR/LLVMOpBase.td",		"include/mlir/Dialect/LLVMIR/LLVMOpBase.td",
"include/mlir/Dialect/LLVMIR/LLVMOps.td",		"include/mlir/Dialect/LLVMIR/LLVMOps.td",
"include/mlir/Dialect/LLVMIR/LLVMOpsInterfaces.td",		"include/mlir/Dialect/LLVMIR/LLVMOpsInterfaces.td",
],		],
includes = ["include"],		includes = ["include"],
▲ Show 20 Lines • Show All 2,623 Lines • ▼ Show 20 Lines	deps = [
":FuncTransformsPassIncGen",		":FuncTransformsPassIncGen",
":GPUDialect",		":GPUDialect",
":GPUPassIncGen",		":GPUPassIncGen",
":GPUToGPURuntimeTransforms",		":GPUToGPURuntimeTransforms",
":GPUToNVVMTransforms",		":GPUToNVVMTransforms",
":GPUToROCDLTransforms",		":GPUToROCDLTransforms",
":GPUToSPIRV",		":GPUToSPIRV",
":GPUToVulkanTransforms",		":GPUToVulkanTransforms",
		":GPUTransformOps",
":GPUTransforms",		":GPUTransforms",
":IR",		":IR",
":LLVMDialect",		":LLVMDialect",
":LLVMIRTransforms",		":LLVMIRTransforms",
":LLVMPassIncGen",		":LLVMPassIncGen",
":LinalgDialect",		":LinalgDialect",
":LinalgPassIncGen",		":LinalgPassIncGen",
":LinalgToLLVM",		":LinalgToLLVM",
▲ Show 20 Lines • Show All 3,265 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][transform] Create GPU transform dialectClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 464948

mlir/include/mlir/Dialect/GPU/CMakeLists.txt

mlir/include/mlir/Dialect/GPU/TransformOps/CMakeLists.txt

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.h

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/include/mlir/InitAllDialects.h

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/TransformOps/CMakeLists.txt

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/test/Dialect/GPU/transform-gpu.mlir

mlir/test/Dialect/Linalg/transform-gpu.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][transform] Create GPU transform dialect
ClosedPublic