Introduce a new transformation on structured ops that tiles the iteration space
using two different tile sizes after splitting it into two parts. The tile
sizes are computed dynamically to be less than some target value and so that
both parts only contain full tiles, and may be required to be divisble by a
certain (constant) value. This provides an alternative to padding and peeling
when the iteration space is not perfectly divisble by the target tile size,
with better guarantees on memory footprint.
Details
- Reviewers
nicolasvasilache shabalin
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
High level comment, could we use TilingInterface for this. The advantage is that you don't need to couple this to the looping construct. I hope to purge the implicit coupling of scf within the implementation of tiling in Linalg. I found that separating the looping constructs from the tiling implementation makes things much simpler. It might help here as well
I looked at the interface and it seems that it would be more involved to implement this over the interface rather than here because of the exponentially expanding loop structure. So it would be preferable to me to have the actual tiling approach reviewed and exercised first, before eventually porting it to the interface in a separate commit.
I wonder if there is an opportunity to split this PR in a few smaller and more composable pieces.
In particular, I would have a use of a general mechanism that splits Linalg ops into smalller Linalg ops that are known to be multiples of X and Y along each dimension while still remaining in Linalg land.
This would compose naturally with existing tiling transforms, reducing the need to port to TilingInterface by reusing as much as possible and gaining tiling to the different types of loops.
Staying in Linalg land also allows say ? -> ?x8 + ?x7 which makes it possible to implement tiling as an N-D Linalg -> 2*N-D Linalg transformation (instead of the current N-D Linalg -> N-D loops + N-D Linalg).
This will also open opportunities for distribution and other fun higher-order stuff.
So please, consider generalizing :)
mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td | ||
---|---|---|
186 | should | |
192 | imperfectly | |
199 | The note deserves its own paragraph for highlighting I believe. | |
mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h | ||
142 | appearance | |
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp | ||
325 | Can we use affine_apply expressions that would make these quantities nicer to parse (and later compose) ? | |
411 | Same Q here, we could literally write something like: create<AffineApply>( {iv * highTileSize + lowTrip * lowTileSize}, ...) and they would render as such. | |
456 | emit ? oh my .. :) :p |
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp | ||
---|---|---|
448 | Reviewed this part, LGTM. |
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp | ||
---|---|---|
307 | computing. | |
310 | I would add an example that mentions a rewrite such as ? -> ?*8 + ?*7. | |
325 | Just to be sure that I don't misinterpret the computation: this is the classical "load balancing" trick right ? I.e. instead of decomposing 801 into 8*100 + 1, one would decompose in 8 * 94 + 7 * 7 ? The way this is described lends to suggest that it may not always be N or N-1 which would be different. |
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp | ||
---|---|---|
325 | Even simpler: 801 into 8*100 + 1, one would decompose in 8 * 99 + 9 * 1. |
This was implemented as a combination of structured.split and structured.multitile_sizes.
should