The current setup of the GPU dialect is to model both the host and
device side codegen. For cases (like IREE) the host side modeling
might not directly fit its use case, but device-side codegen is still
valuable. First step in accessing just the device-side functionality
of the GPU dialect is to allow just creating a gpu.func operation from
a gpu.launch operation. In addition this change also "inlines"
operations into the gpu.func op at time of creation instead of this
being a later step.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
70 ms | MLIR.Dialect/GPU::Unknown Unit Message ("") |
Event Timeline
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
92 | Please don't use "inline" for other aspect that the inliner. What about sink? | |
95 | Nit: don't use auto when it does not improve the readability (line 93 below is explicit for instance) | |
97 | This whole sinking transformation does not seem safe in general: this should check *legality* rather than "benefit". |
Please check the comments around the code you are modifying/moving, some of them no longer describe what the code does after your changes.
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
34–35 | Please update the comment to describe the new API | |
97 |
The function just seems misnamed, should be something like shouldSink because it mixes validity and benefit. In practice, it only returns true for constant and dim operations that don't have side effects. | |
140 | Will this work for blocks whose dominance relation is inverse of their textual order? E.g. ^entry: br ^bb2: ^bb1: "use"(%0) : (index) -> () return ^bb2: %0 = "def"() : () -> (index) br ^bb1 | |
141 | This no longer removes the arguments, but rather updates the map. | |
153 | Will this update the users of the inlinedOps? I don't see the map updated anywhere. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 |
Well, it should check both. You do not want to move all legal operation either :)
This has purely historical reasons. Not long ago, the gpu.launch was closed from above, so this transformation was done when moving to function form. I have a separate pass for this in a local client, which I can send out next week. It just needs tests. It was implemented as a "post transformation" to the outlining and I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 |
Pre-outlining seems easier to manage because region vs inter-procedural (and also can be kept a function pass).
Seems like we're in agreement :) |
Addressing comments and changing the way cloning is of the region of
the gpu.launch operation is done.
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
92 | Changed to "sink" and updated all variables names. | |
97 | A pre-pass is fine, but I think it would be better to leave it here. Eventually, it would be good if all transformations can be expressed as a pattern match and rewrite. This "outlining" is essentially converting a gpu.launchOp to a gpu.launchFuncOp. If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I am not saying anybody does this today (not doing this in IREE), but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns. Re: being able to keep it as a function pass, is related to where the gpu.module is created. As set up right now it is put outside of the function that the gpu.launch operation lives in. Thats a a very specific choice and would be very useful to allow "clients" of the outlining to decide where to put the gpu.module. | |
140 | Thanks for pointing this out. I updated the method to use cloneInto which handles this case (see comment on changes to cloneInto. I can make the change here if you think that is reasonable) | |
153 | It is done within the clone(map) operation. The results of the operation are added to the map as well. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
145 | We are only using the "set" part here right? Just use a set data type? |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 |
I don't understand what you mean, can you elaborate?
This is mixing an optimization within an unrelated transformation: this just does not belong here IMO.
I don't know what you mean or how it answer the point about the function pass right now. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 | Re : separate pass to sink instructions. The Dialect conversion framework is designed to go from A -> B -> C. If I want to target SPIR-V/NVVM from Linalg dialect vial Loop dialect and GPU dialects (i.e. Linalg -> Loops -> GPU -> SPIRV/NVVM), I can add all the patterns for the conversion into the dialect conversion framework. Currently Loops to GPU dialect is not exposed as a conversion pattern. GPU to SPIRV is. By adding extra steps as a "pre-condition" will limit the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). You could add the "sinking" as a canonicalization pattern, but it seems to me this sinking is useful only when the gpu.launch region is outlined to create a gpu.func operation. So doing the sinking during the conversion makes sense to me. Re: fusion pass vs module pass The current setup of the gpu.launch to gpu.launch_func conversion creates a gpu.module that is inserted just after the function the gpu.launch is in. This makes it a module pass, and this behavior is only relevant for the CUDA/NVVM side of things. For IREE, we are only interested in the device side for now. So we can make this a function pass if we can control where the gpu.module is inserted. See this dependent PR in IREE that uses this change and makes the conversion of gpu.launch to gpu.func as a function pass. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 | @mehdi_amini : Update on the function pass vs module pass. You were right that the outlining can only be a module pass since the gpu.module also has a symbol so it needs to be added to a module (or an op with symbol table). So I was wrong about that. I update the PR shown above to be module pass as well, but FYI there was no assert when i did it as a function pass. Just filling in some details about discussion offline. It is true that the sinking could be done as a prepass. If so then it is a separate "pre-processing" pass. It is unclear if sinking can be expressed as a pattern. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 | (typed the following this morning before you last comment, but didn't click send)
Yes, but if you just say this, you can shoehorn anything in there: you could use this mental model to go from Swift SIL to X86 assembly in a single "legalize()" call, I don't think this is a good use of the framework.
I disagree that this is what it is built for. I think this is a misconception of what the framework is intended to solve. If we take for example the HLO -> SPIR-V pipeline, we can likely identify logical stages like HLO->Loops, Loops->GPU Kernel, and GPU Kernel -> SPIRV. These stages are fully disjoint as far as I can tell, and there is no immediate benefit to combine them in a single lowering. On the opposite: if these stages are well separated, this provides the opportunity for passes to run on each intermediate level of abstraction (including some generic things like canonicalization or CSE), and it allows also more reusable blocks (Loops->GPU Kernel can be reused even when you don't come from HLO). It also forces testing at every level and help compiler engineers keeping a mental model where we can reason about this stages and how they compose independently. |
Updating patch to separate the sinking of instructions into launch op
as a separate function.
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
97 | Thanks @mehdi_amini for that overview. I think what you say makes sense and is a good thumb rule to use (probably good to add it somewhere in rationale) Going back to the change at hand. I modified the patch to expose the "sinking" transformation as a separate utility function exposed by the GPU dialect. PTAL, but to me this seems more complex. If this is along the lines of what is the recommendation here, I can work with it for my use case. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
196 | Note: this is still not decoupled from this pass right now (i.e. not tested in isolation, etc.): we still have "outlining" and "sinking" part of the same pass, can't they be separated? |
I don't have further objections other than a bunch of nits. This patch intends to expose _functions_ and can land as is. Refactoring those functions into separate (test) passes is okay for a follow-up IMO.
mlir/include/mlir/Dialect/GPU/Passes.h | ||
---|---|---|
21 | struct, otherwise you'll break windows builds | |
55 | Bikeshed nit: sinkOperationsIntoLaunchOp. "Intructions" aren't a thing in MLIR. | |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
58–59 | This comment looks outdated | |
63 | Bikeshed nit: operands confused me into thinking it referred to the _existing_ launchOp operands, not "values that might become operands if sinking is beneficial". | |
82 | How about iterating over uses rather than users? for (auto use : result.value().getUses()) { if (use.getUser().getParentOfType<gpu::LaunchOp>() == launchOp)) use.getOperand().set(replacement); } | |
89 | Can this happen in a valid IR? If not, I would rather assert. Otherwise, please drop trivial braces | |
194 | Sinking already reports an error, no need to add another one IMO. | |
204 | Nit: Please drop trivial braces |
mlir/include/mlir/Dialect/GPU/Passes.h | ||
---|---|---|
21 | THanks! Interesting that the build bot passed. | |
55 | Good point. I have conflated the two mentally cause of that. | |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
89 | I think so, but I am not sure. Will leave it as an error, and remove braces | |
196 | They are separate functions. I have no visibility into the clients of the pass. So if any user of the pass is relying on sinking happening then removing the sinking would "potentially" break. One could argue that then it is incorrect usage since the gpu.launch_func op gets updated accordingly, but at this point I would rather keep this change as an NFC. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
196 | Keeping it NFC is a very good point! (what would break here is an optimization and not correctness right? So we can still do it in the absolute?) |
mlir/include/mlir/Dialect/GPU/Passes.h | ||
---|---|---|
46 | Nit: this is the only *pass* exposed in a header called Passes.h. |
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp | ||
---|---|---|
196 | Yes, it would indeed be an optimization thing and not a correctness thing. |
struct, otherwise you'll break windows builds