Page MenuHomePhabricator

[mlir][transform] Create GPU transform dialect

Authored by guraypp on Sep 28 2022, 3:42 AM.



This revision adds GPU transform dialect. It also introduce a prefix such as "transform.gpu" for all ops related to this dialect.

MLIR already had two GPU transform op in linalg. This revision moves these ops into GPUTransformOps. The Ops are as follows:

transform.structured.map_nested_foreach_thread_to_gpu_blocks -> transform.gpu.map_foreach_to_blocks
This op selects the outermost (toplevel) foreach_thread and parallelize across GPU blocks. It can also generate gpu_launch.

transform.structured.map_nested_foreach_thread_to_gpu_threads -> transform.gpu.map_nested_foreach_to_threads
This op parallelizes nested foreach_thread that are inside gpu_launch across GPU threads.

It doesn't add new functionality, but there are some minor refactoring of the code.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ftynse added inline comments.Sep 28 2022, 8:14 AM

Nit: end comments with a full stop. Here and below.


Nit: it may be a good idea to create an InsertionGuard at the entry of the function so the rewriter is reset back to where it was on return.


Nit FYI: you can also do return op->emitError(), diagnostics are convertible to failure().


Please expand auto unless the type is clear from local context (there's a cast on the RHS) or difficult to spell (iterators, lambdas).


Nit: this is a "non-trivial" conditional because of a multi-line condition and therefore needs braces around the body.


Nit: we have C++17, for (auto [threadIdx, blockOp] : llvm::zip(...)) should work here.


There's no point in jumping through the updateRootInPlace hoops if you are calling splice above. This function cannot be used with dialect conversion anyway.


Note that this will advance over the "foreach(target-op(foreach()))" construct where the inner foreach is top-level within the given target but has _some_ other foreach ancestor. This should probably check that the foreach ancestor of another foreach has target as ancestor.


IMO, this should be silenceable failure. We haven't irreversibly modified the IR yet.


Please expand auto.


Can we rather return a DiagnosedSilenceableFailure from findTopLevelForeachThreadOp instead of LogicalResult? Unknown errors are almost useless.


Same as above.


Use cast if the result is never checked for being non-null.


Nit: there is no gpu.thread operation, did you mean thread_id?


Nit: creating IR inside another IR-creation call is highly discouraged. This specifically is not problematic because there will be no ordering issue given there's only one nested operation, but some later modification may add a second one leading to unspecified order in some compilers.


Same comment as above about RAUW and splice.


Please add the newline.

This revision now requires changes to proceed.Sep 28 2022, 8:14 AM
guraypp updated this revision to Diff 463853.Sep 29 2022, 5:35 AM

Addressed ftynse comments

@ftynse thank you so much for reviewing this one! I addressed your comments.

guraypp updated this revision to Diff 463857.Sep 29 2022, 5:58 AM
guraypp marked 29 inline comments as done.

Pass SmallVectorImpl instead SmallVector

This seems to include some irrelevant changes.


This change looks unrelated. Rebase went wrong?


Nit: /// here too, // inside functions only.


Nit: this can be removed now.


Looks unrelated.


Nit: DiagnosedSIlenceableFailure(diag.checkAndReport()) is shorter and will not drop the notes from the diagnostic. We can also add some makeFailureDefinite() that turns silenceable failure into a definite one and propagates success.


On a second thought, why is this a definite failure? We usually use definite failures when the transformation could have modified the IR in an irrecoverable way. Definite failures are bubbled up really fast and stop the entire transformation process. This looks more like failed precondition and we can attempt something else. IMO, this should be just simple propagation`if (!diag.succeeded()) return diag;`.

I suppose that you have a longer transform combo in mind, something like "create gpu launch, then do something with it", but that part can also propagate silenceable failures. In the transform dialect context, if nothing explicitly suppresses the error (like transform.alternatives or sequence failures(suppress)), it will ultimately be reported. Here, by emitting a definite failure because of a precondition, you make it impossible to build an alternatives that tries to map to GPU and fallbacks to staying on CPU otherwise, for example.




This doesn't seem fixed.



DiagnosedSilenceableFailure diag = emitSilenceableError() << ""message";
diag.attachNote(target->getLoc()) << "when applied to this payload op";
return diag;

Please add tests for at least some of the user-visible error messages (those coming from emitError / silenceable diagnostic).

guraypp updated this revision to Diff 463862.Sep 29 2022, 6:12 AM
guraypp marked 2 inline comments as done.


guraypp updated this revision to Diff 463897.Sep 29 2022, 7:48 AM

update more definitive errors into silenceable failure

ftynse added inline comments.Sep 29 2022, 8:02 AM

Leftover debug

guraypp added inline comments.Sep 29 2022, 9:08 AM

I removed PDLTypes. It looks like I need PDL not the PDLTypes because I add this as dependent dialect.



I am not sure I understood this one. What else I can use instead of updateRootInPlace?


Good point! I noticed that it has many missing checks and this is one of them. For the case of " gpu::LaunchOp(foreach(target-op(foreach())))", it could be even gpu::LaunchOp before. It does not check that as well.

I will handle these checks in another PR with some tests if it's okay?


Could you please elaborate this comment?

guraypp updated this revision to Diff 463932.Sep 29 2022, 9:13 AM

remove leftover debug

ftynse added inline comments.Sep 29 2022, 9:51 AM

Just ues replaceAllUsesWith directly, without wrapping into updateRootInPlace.




If you do something like b.create<MulFOp>(loc, b.create<AddFOp>(...), b.create<AddFOp>(...)), the order in which the "add" operations will be created is unspecified because the order of function argument evaluation is unspecified and does vary between compilers/platforms. Meaning we cannot reliably test the created IR since we don't know the order of operations in it. This is not a problem if only _one_ of the operands creates additional IR, which is the case here. But it makes it easier to accidentally add a second such call during refactoring, especially if it is hidden in a function call. So we decided at some point that it is safer to _never_ call "create" within the arguments of another "create", not even once. Avoiding such nested create also makes the C++ code slightly more similar to the IR produced: there's one statement per produced operation.

guraypp updated this revision to Diff 464171.Sep 30 2022, 1:04 AM

Remove updateRootInPlace.
Change nested IR generation.

guraypp marked 18 inline comments as done.Sep 30 2022, 1:20 AM
guraypp added inline comments.

I return everything as a "silenceable failure" before modifying the IR. Returning "definitive failure" is useless due to several precondition checks. The user can specify "sequence failures(propagate)" if they want failures. Thanks for explaining this mechanism.

My misunderstanding is caused because "silenceable failure" does not show up with the "applyToOne" as I demonstrated at

Returning "silent failure" has the drawback that the results still need to be appropriately assigned. If results are not set, this time transform dialect itself fails because the number of expected results are not set.


Thanks for the explanation, now I understand what you mean. Yes, the order of generating arguments is undefined. I did an experiment in the link below, gcc and clang on x86 generates the order in the different way. I put here for in case someone wonders in the future :)

ftynse accepted this revision.Sep 30 2022, 4:22 AM

You probably need to rebase on top of the error propagation fix.


The issues with failure propagation should be fixed after lands.

For results, we can add/improve a helper that associates all transform op results (getNumResults) with an empty list of payload iR handles.

This revision is now accepted and ready to land.Sep 30 2022, 4:22 AM
guraypp updated this revision to Diff 464613.Oct 3 2022, 12:58 AM
guraypp marked 2 inline comments as done.
guraypp updated this revision to Diff 464622.Oct 3 2022, 1:49 AM

minor fixes wrt error reporting

guraypp updated this revision to Diff 464638.Oct 3 2022, 2:41 AM

fix bazel build

guraypp updated this revision to Diff 464664.Oct 3 2022, 6:28 AM

Made transformOp optional like llvm::Optional<TransformOpInterface> transformOp for mapNestedForeachToThreadsImp function. This function is needed to be used outside of transform dialect. In this case there is no TransformOpInterface.

This revision was landed with ongoing or failed builds.Oct 4 2022, 4:09 AM
This revision was automatically updated to reflect the committed changes.