Add a transformation to tile reduction ops into a parallel operation
followed by a merge operation. This is equivalent to the existing
reduction spliting transformation but using loops instead of using
higher dimensions linalg.
Details
Diff Detail
Event Timeline
Wow! Awesome! I was just going to accept this, but realized I have one question on further reading. There seems to be an implicit assumption about there being a single reduction dimension through the different interface methods. Maybe its better to be more explicit, and just make the dimension (or list of dimensions) that is being tiled be an explicit argument to the interface methods. That will remove that implicit assumption through the different methods.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
163 | Is there are reason the interface method has to return a tensor. I think if it returns the identity element that should be enough? I am not opposed to doing this.. This might actually be more general, but trying to understand what is the minimum requirement that the interface should expect ops to implement. | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
273 | If I am reading this correctly, the setting of the identity element can be moved into the tileReductionUsingSCFForOp method (related to the comment above) | |
mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
434 | I think SCF already has a depence on Arith dialect. I think you can replace this with getValueOrCreateConstantIndexOp. |
Yes I can do that. I was trying to make the interface more forward looking as I think we should be able to support multiple reductions at some point but I agree it is a bit odd. I'll change the interfaces for now.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
163 | If we return the identity element we still need to create op generating the tensor which would be dependent on the op we are tiling. (for linalg we would generate linalg.fill but not for other ops). Where would you do that? | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
273 | Getting the reductionOp requires code depending on linalg so I'm not sure how we could do that without adding more interface functions. |
Flyby comment: looks like this differential is essentially doing what is being done in https://github.com/iree-org/iree/pull/10537/files#diff-e105ec843efed97f5884ad26c85561c72967f0546029bd772335634f23553c31, which means I can remove that PR and use your new interface. You seem to have gotten rid of the control for inserted index. Do you no longer need that for general purpose reduction tiling?
Yes it could be a potential way to replace this and generalize it by supporting dynamic shapes. The control index can be added to this if needed. Right now the code just makes a decision based on the reduction loop index but for more flexibility there should be a control indeed.
I think this OP should _not_ take care of multiple reduction dimensions. There's already a separate transformation op or transformation (that Thomas wrote earlier) that combines multiple contiguous reduction dimensions - so we can use that to combine the sequence of innermost reductions into one. If the reduction dimension is not innermost, then there's no need to apply the technique as parallel tiling will already create tiles that can be reduced in parallel.
In other words, I think this differential should be accepted as is, with comments (and checks) denoting that only innermost dimension must be reduced using this transformation.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | why are mergeReductions, tileToPartialReduction, generateInitialTensorForPartialReduction a part of tiling interface? They work only for Linalg, don't they? |
Move the reduction detection dimension in the common code and address other review comments.
Thanks Mahesh. As discussed I moved the code detecting the reduction dimension to the common code but creating the tensor is still done by the interface function.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | The idea is that other ops than linalg could implement this interface in the future. For instance we did have to do reduction splitting for linalgext topK and that could be an op that would implement this interface. Also since we need to re-use helpers in the tiling code this allows us to do that without adding dependency between the tiling code and linalg. |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | Is this really tiling the operation? Yes, at some level it moves the computation to working on tiles but those tiles are not "normal" tiles in the sense of all other operations in this interface. Instead, these methods allow to "distribute" a reduction over multiple threads. So I see how having these methods implemented allows to apply this transformation to anything that is like a reduction but they are not useful otherwise. Maybe have a StructuredReductionOp interface? | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
315 | Why is omitPartialTileCheck true here? Do we statically know that the dimension tiles? | |
mlir/test/Dialect/Linalg/transform-tile-reduction.mlir | ||
38 | Why is this valid? What happens if ARG0 does not have a size that is divisible by 5? |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
189 | Could we make this an ArrayRef<unsigned>. Also I have been lead to believe that recommended C++ usage is to use int types unless you are doing bit manipulation. | |
200 | I dont see a reason why this needs to be moved out of TilingInterface into any other interface... If an op doesnt support this kind of tiling then it just does not implement these methods. Also I think there is a final state where we can combine tileToPartialReduction and getTiledImplementation, but that isnt clear yet. So there is a separate method added for now. It is meant to handle tiling of loops that are reductions, which are fine. Haing more interfaces just adds strange dependencies, or results in having redundant interface methods across interfaces. | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
315 | The implementation of the loops already handles the partial tile handling. That is accounted for in the sizes. There is no need for an op implementation to account for it. | |
mlir/test/Dialect/Linalg/transform-tile-reduction.mlir | ||
38 | This is a good point. I'd expect there to be an affine.min generated by generateTiledLoopNest for this case.... |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
---|---|---|
315 | Yes, this is handled by generateTileLoopNest so the sizes are already adjusted. | |
mlir/test/Dialect/Linalg/transform-tile-reduction.mlir | ||
38 | good catch, there was a bug in the size used to insert. It is fixes now and the size is based on affine_min as expected. |
Address more review comments.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
189 | Changed it to ArrayRef<int> |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | I am worried that this change modifies TilingInterface substantially without a proper RFC and it is not clear whether these methods should be a part of the interface at all. |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | This is a strict addition to the TilingInterface, and is an opt-in. So does not impact anyone who is using TilingInterface. So I dont really think it needs an RFC. Can you give more substantial reasoning as to why this should not be part of the TilingInterface? It is reusing the implementation of these methods in other operations, as well as reusing parts of the implementation of tiling using the interface. That is a strong signal to me that it indeed belongs here. I dont see a reason to block this patch. If indeed it turns out to be muddling separation of concerns we can revisit that, but to me this is already layered correctly. |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | These methods are needed only for a subset of tileable operations. Adding it all here is similar to adding everything to LinalgStructuredOp interface and then spending a lot of time extracting DPS interface out of it, because LinalgOps are just a subset of DPS ops. |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | I don't think they are equivalent. The DPS was deemed not a subset of LinalgStructuredOp, but rather a superset. That's why it was pulled out. As you say this is needed only for a subset of tileable operations. If an operation cannot support this way of Tiling, it just doesn't implement those interface methods. Then you cannot apply this transformation on those ops, and that is exactly how it should work. |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
200 | The code is moved all into linalg. This requires exposing more scf helpers. |
I am not on board with exposing these utility functions. The layering of this going into TilingInterface is the right one which will avoid exposing these methods that are really implementation details. Lets hold off till we can reach a consensus.
Based on discussions I moved the methods to a separate interface in parallel of TilingInterface. Please take another look.
Bunch of suggestions for improving code and legitbility but nothing worth delaying more.
Nice stuff, thank you @ThomasRaoux !
mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td | ||
---|---|---|
652 | Maybe document that your transform here overallocates for the sequential case (i.e. you could only have a tensor<5xf32> and iteratively reduce within the loop) but that this also works for the parallel case? Do you expect this to be configurable in the future or to always use the larger allocation? Edit: I realize the regular tiling to scf.for should do the "smaller allocation version". So now I am wondering whether we see an advantage doing it this way in the sequential case ? (i.e. is this related to @dcaballe 's vectorization comments ?) | |
mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h | ||
143 | better docs here plz | |
153 | nit: merged | |
154 | super nit: column | |
160 | super nit: extra newline | |
mlir/include/mlir/Interfaces/TilingInterface.td | ||
159 | I like this new interface, not all ops can support partial-reduction tiling without quite more effort (e.g. gather/scatter/sort). | |
162 | to tile reductions using partial reductions + merge ? | |
169 | nit: methode is French :) (here and below) | |
179 | nit: reductionDims (plural) since you take an ArrayRef ? | |
189 | Our default type is int64_t everywhere, any concrete reason to use int (I don't think the memory usage for a few ints jusifies int)? | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
256 | OpBuilder::InsertionGuard guard(b); In anything that takes a builder reference, otherwise you're leaking insertion point changes across function boundaries which creates bugs that are very hard to track. | |
263 | int64_t everywhere, all the time, plz (unless doing bit manipulations). | |
276 | I don't see a new map here ? | |
292 | Please use ShapedType::isDynamic. | |
310 | insertion guard | |
317 | Can we add a proper AffineMap.h/cpp helper: AffineMap::injectResult(int64_t resultPos, AffineExpr newResult) and just use that directly: newMaps.back() = newMaps.back() .injectResult(...); | |
318 | Some comment here re injecting the split dimension (e0 .. ek .. en) -> (e0 .. ek, split_dim, ek+1 .. en) | |
327 | Better comments plz: Step 1. ... | |
333 | Plz add a top-of-the-function // TODO: SubsetExtractOpInterface | |
355 | insertion guard | |
365 | I have a personal preference for fewer loops / less code under loops: SmallVector<StringRef> reductionIteratorTypes(intermRank, getParallelIteratorTypeName()); reductionIteratorTypes[dimToMerge] = `(); AffineMap outputMap = AffineMap::multiDimSomethingMap().drop_result(dimToMerge); // or the equivalent APIs | |
367 | unsigned -> int64_t everywhere plz | |
mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
432 | Interesting, we don't have a good way of enforcing that an interface implies another. | |
446 | Can we just use an llvm::count and llvm::any_of/all_of ? int64_t numReductionDims = llvm::count/count_if(lambda); if (numReductionDims != 1) error; if (llvm::any_of(lambda)) return b.notifyMatchFailure(op, "only reduction dimensions can have non zero tile sizes"); | |
467 | return b.notifyMatchFailure( for better debugging | |
470 | Can you add a TODO at the func declaration that we should use ArrayRef<OpFoldResult> instead of ArrayRef<Value>? | |
475 | something's fishy between this insertion point and the next .. either the next one crashes or this one does nothing. | |
485 | You need to wrap this into something else to get a real OpFoldResult. | |
493 | Is the else ever legitimate ? |
Thanks for refactoring. Having this in a separate interface addresses my concerns about the partial interface. Did not look at the deep details but also do not want to block this.
Nicolas already seems to have reviewed in great detail! The overall structure looks fine to me. Thanks @ThomasRaoux for going through the different iterations to reach convergence.
mlir/include/mlir/Interfaces/TilingInterface.td | ||
---|---|---|
159 | Agreed! sort can actually support it I think :) |
Address review comments.
mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td | ||
---|---|---|
652 | I'm adding more doc, in general the only case where we would want to allocate less is if the reduction size is smaller than the tile size. This could become an affin.min but I'm not sure we would ever want that. We can iterate if needed. | |
mlir/lib/Dialect/Linalg/Transforms/TilingInterfaceImpl.cpp | ||
256 | even if I don't move the insertion point? | |
276 | Yes the comment was outdated. Rewrote it. | |
mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp | ||
432 | Makes sense, added a comment. | |
475 | oops yeah removed this code | |
493 | I don't think there is, I removed the if |
Maybe document that your transform here overallocates for the sequential case (i.e. you could only have a tensor<5xf32> and iteratively reduce within the loop) but that this also works for the parallel case?
Do you expect this to be configurable in the future or to always use the larger allocation?
Edit: I realize the regular tiling to scf.for should do the "smaller allocation version". So now I am wondering whether we see an advantage doing it this way in the sequential case ? (i.e. is this related to @dcaballe 's vectorization comments ?)