Extend fusion on tensors to fuse producers greedily.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This is the PR I'm looking for, thanks!
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir | ||
---|---|---|
71–77 | Should we capture iter_args and extract slice from iter_args? I think we are doing destructive update, and everything should work on the iter_args. I.e., extract slice from iter_args, fill and matmul on the slice, insert the slice into iter_args and yield it. If the inputs are fill+matmul, the dest of fill op is a slice of iter_args. E.g., input: module { func @fuse_output(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %0 = tensor.dim %arg0, %c0 : tensor<?x?xf32> %1 = tensor.dim %arg1, %c1 : tensor<?x?xf32> %2 = linalg.init_tensor [%0, %1] : tensor<?x?xf32> %3 = linalg.fill(%cst, %2) : f32, tensor<?x?xf32> -> tensor<?x?xf32> %4 = linalg.matmul ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%3 : tensor<?x?xf32>) -> tensor<?x?xf32> return %4 : tensor<?x?xf32> } } output: ... %5 = scf.for %arg2 = %c0 to %1 step %c4 iter_args(%arg3 = %2) -> (tensor<?x?xf32>) { %6 = scf.for %arg4 = %c0 to %0 step %c5 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) { %7 = affine.min #map0(%arg4)[%0] %8 = tensor.extract_slice %arg0[%arg4, 0] [%7, %4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32> %9 = tensor.dim %arg1, %c0 : tensor<?x?xf32> %10 = affine.min #map1(%arg2)[%1] %11 = tensor.extract_slice %arg1[0, %arg2] [%9, %10] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32> %12 = tensor.extract_slice %arg5[%arg4, %arg2] [%7, %10] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32> %13 = tensor.dim %2, %c0 : tensor<?x?xf32> %14 = tensor.dim %2, %c1 : tensor<?x?xf32> %15 = affine.min #map2(%7, %arg4)[%13] %16 = affine.min #map2(%10, %arg2)[%14] %17 = linalg.fill(%cst, %12) : f32, tensor<?x?xf32> -> tensor<?x?xf32> %18 = linalg.matmul ins(%8, %11 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%17 : tensor<?x?xf32>) -> tensor<?x?xf32> %19 = tensor.insert_slice %18 into %arg5[%arg4, %arg2] [%7, %10] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32> scf.yield %19 : tensor<?x?xf32> } scf.yield %6 : tensor<?x?xf32> } ... where %12 is tensor.extract_slice %arg5[%arg4, %arg2] [%7, %10] [1, 1] |
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp | ||
---|---|---|
102–103 | comment not needed, since the code describes what it does. | |
109 | Maybe add an assertion for the assumption? Also, don't need the comment when we have the assertion. assert(map.isProjectedPermutation() && "expected projected permutation"); | |
157–158 | [optional] I'd prefer int64_t tiledSliceDim = std::get<0>(it); int64_t tiledProducerLoop = std::get<1>(it); |
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir | ||
---|---|---|
71–77 | That is right and the code actually updates the iteration argument. I did just not add this to the test since I wanted to match the new stuff (fusing a chain of operations and considering the right index dimensions). I will try to improve the test to make things more clear. There is one limitation that is on purpose though. This revision always tiles the consumer and adds its outputs to the iteration arguments. However, it never adds additional iteration arguments while fusing producers. If a result of a producer is used outside of the fused tile loop nest the original producer will thus remain in the code for now. We may handle this use case in a follow up revision if there is a use case. |
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp | ||
---|---|---|
109 | I updated the comments and made the assertion more concise. In fact, it is sufficient if there is a one-to-one mapping from slice dimension to loop dimension. We do not care about dimensions that are not sliced. |
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir | ||
---|---|---|
71–77 | Thanks for the explanation. I think we want to add additional iteration arguments. Let's do it in a follow up revision. |
Sorry for the super long delay .. it fell off my radar and I am happy that we have fancier e2e use cases to exercise this, and new users chiming in :)
Thanks also @hanchung for your review.
Minor cosmetic comments but LGTM!
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp | ||
---|---|---|
98 | Better doc comments + minimal example would be useful (no need to fully spell out the IR, pseudo-IR with the relevant info is what we want). Given a list of `tiledSliceDims` that represent ... return the indices of ... Example: for i for j for k something getTiledProducerLoops( ... ) returns ... | |
100 | static? | |
102–103 | Always bias towards more and better comments, none of this is trivial to verbalise -> 'get the indexing map of the output operand in the producer op, that matches the opResult ' | |
108 | there should be an existing AffineMap::xxxinversexxx function that you can reuse or extend (or create if nothing fits) instead of rolling your own. | |
198–199 | validity conditions change but I see no doc change. | |
336 | I'd even check that they are all in the same block, you never know what will happen in the future. | |
354 | I'd append Index/Indices to all these vairable names to be super explicit. | |
360 | loop *indices*. | |
421 | candidates.pop_back_val() and drop the separate pop_back() call | |
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir | ||
71–77 | You shouldn't need anything new in the compiler for this. There is something interesting to be done related to iter_args but it is quite more complex and I think unnecessary atm. |
Better doc comments + minimal example would be useful (no need to fully spell out the IR, pseudo-IR with the relevant info is what we want).