To support cases where a fused producer value is computed within the loop
and yielded, with the yield value being used as the replacement of the fused
op; add callback that provides the caller control towards this.
Details
- Reviewers
- nicolasvasilache - pzread - chelini - hanchung 
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
|---|---|---|
| 674–684 | Is it possible that it failed? If so, should we signal something? IF not, can we add an assertion and we won't need one level of nesting. | |
| mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp | ||
| 272–274 | should we add memref::DimOp to the list? | |
Addressing minor comment that was missed.
| mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h | ||
|---|---|---|
| 93 | Sorry, that was confusing. Fixed. | |
| mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
| 674–684 | Dropped the FailureOr from the called method. It always succeeds. | |
| mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp | ||
| 272–274 | tile and fuse doesnt work with memrefs and also with memref there is no replacement to do. | |
| mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
|---|---|---|
| 674–684 | do we need to capture the result of yieldTiledValues? It's not used, and I assume that it's only for generating IRs. | |
| mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp | ||
|---|---|---|
| 36 | should be consistent with other uses (see below) | |
This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.
Putting a blocker until I can make sense of this.
I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...
I'll paste my comment from https://reviews.llvm.org/D138882 here.
Injecting lambda form above allows inversion of control that is convenient very short term and has almost always proven to very quickly turn to technical debt. Given recent offline discussions I had and other parts of the codebase I have seen, I am going to more seriously push back against this anti-pattern globally. Injecting C++ callback control from above is a sign of missing abstractions and should almost always be disallowed. The alternative is often to refactor multiple times until the right abstractions emerge. In other words, the transformations we add must be functional-style, statically return the pieces of IR (existing created or updated) that make sense for that transform. This is not something customizable, if you need more information then statically return more information: no backchannels through lambdas. If you need different behaviors, instead of injecting a dynamic mechanism through a lambda, write another transformation that takes more/different inputs and return more/different outputs. Refactor the reusable utility helpers to avoid copy-pasta. These transformations can then be plugged into patterns and transform dialect ops who can be responsible for the switch between different static behaviors.
Hard stop on more lambdas from above, I am happy to iterate with you on the IREE side to a good solution.
As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.
Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.
If I have to internalize that comment, and the places you have blocked this, all the places you mentioned are worklist based algorithms (either using pattern rewrite based fixed point solutions, or the worklist algorithm used here). In both the cases, the callbacks allow the way the worklist is built. Callbacks allow control externally without leaking too much of the implementation details. I think that is useful if done in specific/controlled ways. I am not a fan of hard red-lines. In this particular case, I can rewrite things to not use callbacks but it will have to leak too much of the implementation details to allow callers to put things together again. Thats the tradeoff IMO
As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.
Since you have put a hard stop to this, could point me to where this is. Since there is unilateral decision made to disallow callbacks, Ill try to conform to this. Also btw, not related to softmax (that already works), and not related to scf.foreach_thread cause it is not related to distribution.
Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.
I apologize for the language, I realize upon rereading that it was too harsh and didn't mean it this way, I could have posted in less of a hurry and worded things better.
As we discussed offline, I would love to be able to come up with a characterization where we are very careful with C++ callbacks and avoid things that amount to inversion of control and breaking the functional contract that has worked very well for the implementation of most transformations logic.
I do think this example can be written in a way that avoids this injection and I am happy to help iterate until we get there (it may be a long process and I am sorry for the drag this has been causing for a too long time already).
The piece I was referring to I think was actually adapted from an earlier version of yours, it is only used inside the transform dialect for now but if there is a need to extract as a free function we can:
https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp?L504
My experience with this is that this can run in 2 modes: either use the list of ops to fuse incrementally and let the transformation figure things out, or call the transformation itself in a sequence controlled from outside, which allows more flexibility.
I could offer to try the example you have in mind through this and see whether something new appears that would force us to rethink the approach.
Thanks for that!
As we discussed offline, I would love to be able to come up with a characterization where we are very careful with C++ callbacks and avoid things that amount to inversion of control and breaking the functional contract that has worked very well for the implementation of most transformations logic.
I do think this example can be written in a way that avoids this injection and I am happy to help iterate until we get there (it may be a long process and I am sorry for the drag this has been causing for a too long time already).The piece I was referring to I think was actually adapted from an earlier version of yours, it is only used inside the transform dialect for now but if there is a need to extract as a free function we can:
https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp?L504My experience with this is that this can run in 2 modes: either use the list of ops to fuse incrementally and let the transformation figure things out, or call the transformation itself in a sequence controlled from outside, which allows more flexibility.
I could offer to try the example you have in mind through this and see whether something new appears that would force us to rethink the approach.
Thanks. I think I understand. I will hold off on this patch, and send some NFC patches that basically do
- Have an API entry that does one step of the fusion, i.e. one producer -> tensor.extract_slice swap.
- Refactor the existing scf::tileConsumerAndFuseProducersGreedily to use this and then maybe have different API entry points that allow having a list of producers and tile + fuse all of them etc. that use the entry point added in (1).
Might not get to this soon (I am on vacation for a few days). Ill send out the stack once I have something working.
Why use here? SliceOp is an Operation*.