This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Add support for transitive fusion.
ClosedPublic

Authored by gysit on Sep 22 2021, 9:40 AM.

Details

Summary

Extend fusion on tensors to fuse producers greedily.

Diff Detail

Event Timeline

gysit created this revision.Sep 22 2021, 9:40 AM
gysit requested review of this revision.Sep 22 2021, 9:40 AM
gysit updated this revision to Diff 374271.Sep 22 2021, 9:45 AM

Remove rebasing artifact.

gysit updated this revision to Diff 379659.Oct 14 2021, 4:16 AM

Rebase.

This is the PR I'm looking for, thanks!

mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir
71–77

Should we capture iter_args and extract slice from iter_args? I think we are doing destructive update, and everything should work on the iter_args. I.e., extract slice from iter_args, fill and matmul on the slice, insert the slice into iter_args and yield it.

If the inputs are fill+matmul, the dest of fill op is a slice of iter_args. E.g.,

input:

module  {
  func @fuse_output(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
    %2 = linalg.init_tensor [%0, %1] : tensor<?x?xf32>
    %3 = linalg.fill(%cst, %2) : f32, tensor<?x?xf32> -> tensor<?x?xf32>
    %4 = linalg.matmul ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%3 : tensor<?x?xf32>) -> tensor<?x?xf32>
    return %4 : tensor<?x?xf32>
  }
}

output:

...
    %5 = scf.for %arg2 = %c0 to %1 step %c4 iter_args(%arg3 = %2) -> (tensor<?x?xf32>) {
      %6 = scf.for %arg4 = %c0 to %0 step %c5 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
        %7 = affine.min #map0(%arg4)[%0]
        %8 = tensor.extract_slice %arg0[%arg4, 0] [%7, %4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
        %9 = tensor.dim %arg1, %c0 : tensor<?x?xf32>
        %10 = affine.min #map1(%arg2)[%1]
        %11 = tensor.extract_slice %arg1[0, %arg2] [%9, %10] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
        %12 = tensor.extract_slice %arg5[%arg4, %arg2] [%7, %10] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
        %13 = tensor.dim %2, %c0 : tensor<?x?xf32>
        %14 = tensor.dim %2, %c1 : tensor<?x?xf32>
        %15 = affine.min #map2(%7, %arg4)[%13]
        %16 = affine.min #map2(%10, %arg2)[%14]
        %17 = linalg.fill(%cst, %12) : f32, tensor<?x?xf32> -> tensor<?x?xf32>
        %18 = linalg.matmul ins(%8, %11 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%17 : tensor<?x?xf32>) -> tensor<?x?xf32>
        %19 = tensor.insert_slice %18 into %arg5[%arg4, %arg2] [%7, %10] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
        scf.yield %19 : tensor<?x?xf32>
      }
      scf.yield %6 : tensor<?x?xf32>
    }
...

where %12 is tensor.extract_slice %arg5[%arg4, %arg2] [%7, %10] [1, 1]

hanchung added inline comments.Nov 1 2021, 5:16 PM
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
59–60

comment not needed, since the code describes what it does.

66

Maybe add an assertion for the assumption? Also, don't need the comment when we have the assertion.

assert(map.isProjectedPermutation() && "expected projected permutation");

139–140

[optional] I'd prefer

int64_t tiledSliceDim = std::get<0>(it);
int64_t tiledProducerLoop = std::get<1>(it);
gysit added inline comments.Nov 3 2021, 1:40 AM
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir
71–77

That is right and the code actually updates the iteration argument. I did just not add this to the test since I wanted to match the new stuff (fusing a chain of operations and considering the right index dimensions). I will try to improve the test to make things more clear.

There is one limitation that is on purpose though. This revision always tiles the consumer and adds its outputs to the iteration arguments. However, it never adds additional iteration arguments while fusing producers. If a result of a producer is used outside of the fused tile loop nest the original producer will thus remain in the code for now. We may handle this use case in a follow up revision if there is a use case.

gysit updated this revision to Diff 384391.Nov 3 2021, 4:00 AM
gysit marked 2 inline comments as done.

Address comments.

gysit marked 2 inline comments as done.Nov 3 2021, 4:03 AM
gysit added inline comments.
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
66

I updated the comments and made the assertion more concise. In fact, it is sufficient if there is a one-to-one mapping from slice dimension to loop dimension. We do not care about dimensions that are not sliced.

hanchung accepted this revision.Nov 3 2021, 10:08 AM
hanchung added inline comments.
mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir
71–77

Thanks for the explanation. I think we want to add additional iteration arguments. Let's do it in a follow up revision.

This revision is now accepted and ready to land.Nov 3 2021, 10:08 AM
nicolasvasilache accepted this revision.Nov 4 2021, 6:24 AM

Sorry for the super long delay .. it fell off my radar and I am happy that we have fancier e2e use cases to exercise this, and new users chiming in :)
Thanks also @hanchung for your review.

Minor cosmetic comments but LGTM!

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
55

Better doc comments + minimal example would be useful (no need to fully spell out the IR, pseudo-IR with the relevant info is what we want).

Given a list of `tiledSliceDims` that represent ... return the indices of ...

Example:
for i
  for j
    for k
      something

getTiledProducerLoops( ... ) returns ...
57

static?

59–60

Always bias towards more and better comments, none of this is trivial to verbalise -> 'get the indexing map of the output operand in the producer op, that matches the opResult '

65

there should be an existing AffineMap::xxxinversexxx function that you can reuse or extend (or create if nothing fits) instead of rolling your own.

180–181

validity conditions change but I see no doc change.
I think all the cases required for validity should be enumerated in the proper place in the doc and kept up to date.

318

I'd even check that they are all in the same block, you never know what will happen in the future.

337

I'd append Index/Indices to all these vairable names to be super explicit.
I keep on expecting to see a Value coming from a DimOp here ..

342

loop *indices*.

402

candidates.pop_back_val() and drop the separate pop_back() call

mlir/test/Dialect/Linalg/tile-and-fuse-sequence-on-tensors.mlir
71–77

You shouldn't need anything new in the compiler for this.
You problem is that init_tensor should be created above and passed as an extra argument.
This is closer to a PL / compiler co-design that I replied to internally.

There is something interesting to be done related to iter_args but it is quite more complex and I think unnecessary atm.

gysit updated this revision to Diff 384769.Nov 4 2021, 8:46 AM
gysit marked 10 inline comments as done.

Address comments.

gysit added a comment.Nov 4 2021, 8:47 AM

mark comments as done.

This revision was landed with ongoing or failed builds.Nov 4 2021, 9:25 AM
This revision was automatically updated to reflect the committed changes.