Page MenuHomePhabricator

[mlir][transform] Create GPU transform dialect
ClosedPublic

Authored by guraypp on Sep 28 2022, 3:42 AM.

Details

Summary

This revision adds GPU transform dialect. It also introduce a prefix such as "transform.gpu" for all ops related to this dialect.

MLIR already had two GPU transform op in linalg. This revision moves these ops into GPUTransformOps. The Ops are as follows:

transform.structured.map_nested_foreach_thread_to_gpu_blocks -> transform.gpu.map_foreach_to_blocks
This op selects the outermost (toplevel) foreach_thread and parallelize across GPU blocks. It can also generate gpu_launch.

transform.structured.map_nested_foreach_thread_to_gpu_threads -> transform.gpu.map_nested_foreach_to_threads
This op parallelizes nested foreach_thread that are inside gpu_launch across GPU threads.

It doesn't add new functionality, but there are some minor refactoring of the code.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ftynse added inline comments.Sep 28 2022, 8:14 AM
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
63

Nit: end comments with a full stop. Here and below.

87

Nit: it may be a good idea to create an InsertionGuard at the entry of the function so the rewriter is reset back to where it was on return.

103–105

Nit FYI: you can also do return op->emitError(), diagnostics are convertible to failure().

150

Please expand auto unless the type is clear from local context (there's a cast on the RHS) or difficult to spell (iterators, lambdas).

154–155

Nit: this is a "non-trivial" conditional because of a multi-line condition and therefore needs braces around the body.

176

Nit: we have C++17, for (auto [threadIdx, blockOp] : llvm::zip(...)) should work here.

181–182

There's no point in jumping through the updateRootInPlace hoops if you are calling splice above. This function cannot be used with dialect conversion anyway.

196–197

Note that this will advance over the "foreach(target-op(foreach()))" construct where the inner foreach is top-level within the given target but has _some_ other foreach ancestor. This should probably check that the foreach ancestor of another foreach has target as ancestor.

199
212
213
238

IMO, this should be silenceable failure. We haven't irreversibly modified the IR yet.

241

Please expand auto.

243

Can we rather return a DiagnosedSilenceableFailure from findTopLevelForeachThreadOp instead of LogicalResult? Unknown errors are almost useless.

253–254

Same as above.

260

Use cast if the result is never checked for being non-null.

292

Nit: there is no gpu.thread operation, did you mean thread_id?

331

Nit: creating IR inside another IR-creation call is highly discouraged. This specifically is not problematic because there will be no ordering issue given there's only one nested operation, but some later modification may add a second one leading to unspecified order in some compilers.

365

Same comment as above about RAUW and splice.

508

Please add the newline.

This revision now requires changes to proceed.Sep 28 2022, 8:14 AM
guraypp updated this revision to Diff 463853.Sep 29 2022, 5:35 AM

Addressed ftynse comments

@ftynse thank you so much for reviewing this one! I addressed your comments.

guraypp updated this revision to Diff 463857.Sep 29 2022, 5:58 AM
guraypp marked 29 inline comments as done.

Pass SmallVectorImpl instead SmallVector

This seems to include some irrelevant changes.

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
1340–1355

This change looks unrelated. Rebase went wrong?

1342

Nit: /// here too, // inside functions only.

1353

Nit: this can be removed now.

mlir/include/mlir/InitAllDialects.h
117

Looks unrelated.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
83–84

Nit: DiagnosedSIlenceableFailure(diag.checkAndReport()) is shorter and will not drop the notes from the diagnostic. We can also add some makeFailureDefinite() that turns silenceable failure into a definite one and propagates success.

84

On a second thought, why is this a definite failure? We usually use definite failures when the transformation could have modified the IR in an irrecoverable way. Definite failures are bubbled up really fast and stop the entire transformation process. This looks more like failed precondition and we can attempt something else. IMO, this should be just simple propagation`if (!diag.succeeded()) return diag;`.

I suppose that you have a longer transform combo in mind, something like "create gpu launch, then do something with it", but that part can also propagate silenceable failures. In the transform dialect context, if nothing explicitly suppresses the error (like transform.alternatives or sequence failures(suppress)), it will ultimately be reported. Here, by emitting a definite failure because of a precondition, you make it impossible to build an alternatives that tries to map to GPU and fallbacks to staying on CPU otherwise, for example.

118–119

Ditto.

196–197

This doesn't seem fixed.

223–226

Ditto.

256–259
DiagnosedSilenceableFailure diag = emitSilenceableError() << ""message";
diag.attachNote(target->getLoc()) << "when applied to this payload op";
return diag;

Please add tests for at least some of the user-visible error messages (those coming from emitError / silenceable diagnostic).

guraypp updated this revision to Diff 463862.Sep 29 2022, 6:12 AM
guraypp marked 2 inline comments as done.

rebase

guraypp updated this revision to Diff 463897.Sep 29 2022, 7:48 AM

update more definitive errors into silenceable failure

ftynse added inline comments.Sep 29 2022, 8:02 AM
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
438

Leftover debug

guraypp added inline comments.Sep 29 2022, 9:08 AM
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
15

I removed PDLTypes. It looks like I need PDL not the PDLTypes because I add this as dependent dialect.

declareDependentDialect<pdl::PDLDialect>();

181–182

I am not sure I understood this one. What else I can use instead of updateRootInPlace?

196–197

Good point! I noticed that it has many missing checks and this is one of them. For the case of " gpu::LaunchOp(foreach(target-op(foreach())))", it could be even gpu::LaunchOp before. It does not check that as well.

I will handle these checks in another PR with some tests if it's okay?

331

Could you please elaborate this comment?

guraypp updated this revision to Diff 463932.Sep 29 2022, 9:13 AM

remove leftover debug

ftynse added inline comments.Sep 29 2022, 9:51 AM
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
181–182

Just ues replaceAllUsesWith directly, without wrapping into updateRootInPlace.

196–197

Ok.

331

If you do something like b.create<MulFOp>(loc, b.create<AddFOp>(...), b.create<AddFOp>(...)), the order in which the "add" operations will be created is unspecified because the order of function argument evaluation is unspecified and does vary between compilers/platforms. Meaning we cannot reliably test the created IR since we don't know the order of operations in it. This is not a problem if only _one_ of the operands creates additional IR, which is the case here. But it makes it easier to accidentally add a second such call during refactoring, especially if it is hidden in a function call. So we decided at some point that it is safer to _never_ call "create" within the arguments of another "create", not even once. Avoiding such nested create also makes the C++ code slightly more similar to the IR produced: there's one statement per produced operation.

guraypp updated this revision to Diff 464171.Sep 30 2022, 1:04 AM

Remove updateRootInPlace.
Change nested IR generation.

guraypp marked 18 inline comments as done.Sep 30 2022, 1:20 AM
guraypp added inline comments.
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
84

I return everything as a "silenceable failure" before modifying the IR. Returning "definitive failure" is useless due to several precondition checks. The user can specify "sequence failures(propagate)" if they want failures. Thanks for explaining this mechanism.

My misunderstanding is caused because "silenceable failure" does not show up with the "applyToOne" as I demonstrated at https://reviews.llvm.org/D134886.

Returning "silent failure" has the drawback that the results still need to be appropriately assigned. If results are not set, this time transform dialect itself fails because the number of expected results are not set.

331

Thanks for the explanation, now I understand what you mean. Yes, the order of generating arguments is undefined. I did an experiment in the link below, gcc and clang on x86 generates the order in the different way. I put here for in case someone wonders in the future :)
https://godbolt.org/z/TYf3cqxx5

ftynse accepted this revision.Sep 30 2022, 4:22 AM

You probably need to rebase on top of the error propagation fix.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
84

The issues with failure propagation should be fixed after https://reviews.llvm.org/D134948 lands.

For results, we can add/improve a helper that associates all transform op results (getNumResults) with an empty list of payload iR handles.

This revision is now accepted and ready to land.Sep 30 2022, 4:22 AM
guraypp updated this revision to Diff 464613.Oct 3 2022, 12:58 AM
guraypp marked 2 inline comments as done.
guraypp updated this revision to Diff 464622.Oct 3 2022, 1:49 AM

minor fixes wrt error reporting

guraypp updated this revision to Diff 464638.Oct 3 2022, 2:41 AM

fix bazel build

guraypp updated this revision to Diff 464664.Oct 3 2022, 6:28 AM

Made transformOp optional like llvm::Optional<TransformOpInterface> transformOp for mapNestedForeachToThreadsImp function. This function is needed to be used outside of transform dialect. In this case there is no TransformOpInterface.

This revision was landed with ongoing or failed builds.Oct 4 2022, 4:09 AM
This revision was automatically updated to reflect the committed changes.