This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Fusion on tensors.
AbandonedPublic

Authored by gysit on Aug 31 2021, 7:10 AM.

Details

Reviewers
nicolasvasilache
Summary

Add a new version of fusion on tensors that supports the following scenarios:

  • support fusion on all levels of the tile loop nest
  • fuse a producer result passed in via tile loop iteration arguments (update the tile loop iteration arguments)
  • supports only linalg operations on tensors
  • supports only scf::for
  • cannot add an output to the tile loop nest

The LinalgFusionOnTensors pass tries to fuse all producers of tiled ops.

Diff Detail

Event Timeline

gysit created this revision.Aug 31 2021, 7:10 AM
gysit requested review of this revision.Aug 31 2021, 7:10 AM

some early comments, will pick up again a bit later.

mlir/include/mlir/Dialect/Linalg/Passes.td
59

Can we keep the same name between the pass and the string? i.e. def LinalgFuseTensorOps.

60

Typo tesnors.

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
30

consisting of exactly intermediate extract_slice and scf::for ops ?

Also, please mention that the use-def search stops as soon as a different op is found.

53

Add a TODO to support other types of enclosing ops (e.g. scf.if) and other types of loops ?
Since we are in value land, we should be able to handle more interesting cases?

161

We usually just go for T mlir::linalg::isProducerFusable(...) {...}

gysit updated this revision to Diff 369757.Aug 31 2021, 11:43 AM

Address comments.

gysit marked 5 inline comments as done.Aug 31 2021, 11:45 AM
mravishankar added inline comments.Aug 31 2021, 10:06 PM
mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
164

Curious why we need this and the producer itself. I dont think we can split the producer and "fuse" individual OpResult if we have ops with multiple results

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
55

would this not be easier to get with the getUseDefChain above?

174

Nit: fold this with the check below. AN error saying "expected linalg op" implies blockargument is not expected.

189

Third time traversing this. Its a style comment, so sort of a nit. You could compute the sliceOps and loops and bbArgs at the same time while traversing the use-def chain. The chain is small, so its not an efficiency thing, but more cognitive overhead.

mlir/test/Dialect/Linalg/fusion-on-tensors.mlir
37

Am I reading this right? The fill is coming into the k loop?

@nicolasvasilache @mravishankar thanks for the feedback so far. I already addressed some comments and will continue to do so.

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
164

The current version of fusion supports fusing producers with multiple outputs but introduces redundant computation.

Two follow up changes could make sense:

  • a canonicalization pattern that removes unused outputs from a linalg op (probably limited to generic ops)
  • adding the additional outputs of the cloned producer to the tile loop iteration arguments and replace all uses after the tile loop nest.
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
55

I think if we fuse a consumer input that is not passed via iteration argument we still want to find the tile loops. They are needed to make the match between the tile loop dimensions and the extract slice ops.

189

I think one thing that introduces maybe a bit of complexity here is that code does not enforce an extract slice op for every tile loop (which helps making the unit tests more readable!). Also not that once we start fusing dependencies transitively there start to be tile loops that are actually not tiling the fused operation... I thus currently believe the choice of getting the loops first and then try to map the extract slice ops is better. But I will think about simplifying this code!

mlir/test/Dialect/Linalg/fusion-on-tensors.mlir
37

Yes since the fill is consumed by an input operand it is fine to fuse it into the k loop in this case. It is a trade-off though and at some point we may want to control if fusion shall introduce redundant computation and if yes how much.

gysit updated this revision to Diff 369871.Sep 1 2021, 12:09 AM

Address one more comment.

gysit marked an inline comment as done.Sep 1 2021, 12:10 AM
gysit updated this revision to Diff 369903.Sep 1 2021, 5:31 AM

Address comment and merge getTileLoops and getUseDefChain into one method.

gysit marked 2 inline comments as done.Sep 1 2021, 5:34 AM
gysit added inline comments.
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
189

I collect the tile loops and the use def chain now in one method getTileLoopNest and I think it is an improvement!

Added a few more comments.

mlir/include/mlir/Dialect/Linalg/Passes.td
64

Is this the consumer op tile sizes ?
I think it is confusing to take about "loops dimensions" here as this is also dependent on the op semantics.

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
169

Add a note that is this currently limited to 1-1 output permutations to shape dimension but that this will be relaxed in the future?

182

I do not understand what tileLoopDims is and why it is needed.
From your usage I only see that it is initialized to null everywhere?

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
38

I find the new API quite difficult to follow: there are hidden assumptions about when some entries are null but not others and how things are supposed to align that I have a hard time following.

Can we have a single vector with struct with a proper name that contains {scf::ForOp, tensor::ExtractSliceOp, BlockArgument} with all the properties spelled out explicitly with helper functions?
e.g. a strawman

struct AGoodName {
  void isPropertyX(int index) {
    return tileLoops[index] && sliceOps[index] && !bbArgs[index];
  }
  ...

  SmallVector<scf::ForOp> tileLoops;
  SmallVector<tensor::ExtractSliceOp> sliceOps;
  SmallVector<BlockArgument> bbArgs;
}
50

LoopLikeInterface has a "isDefinedOutsideOfLoop" method for this.

214

I'd lift that as a helper in the LinalgInterfaces, seems generally useful and a good place to centralize as we generalize.

225

this type of bookkeeping should be internal to the struct I mention above.

228

This should be a standalone documented helper function.

233

This feels like an assumption that comes from tiling invariants and is used at a distance without guarantees?
This isn't something we can generally rely on.
This seems to require a "tile and fuse" approach to the problem?

247

This is also internal bookkeeping of the struct that should have an intuitive name.

254

I'd rephrase this with an example and isolate in a helper function; e.g.

Consider the case of fusion of an output tensor:

%0 = producer(...)
%2 = consumer ins(...) outs(%0)

When consumer is tiled, %0 appears as a loop iter_arg, i.e.:

%0 = producer(...)
%2 = scf.for ... iter_args(%0) .. (%bbarg) {
  %t = tensor.extract_slice %bbarg[..]
  %c = consumer ins(...) (%bbarg)
  %r = tensor.insert_slice %t, %bbarg[...]
}

Fusing %0 into the loop requires updating iter_args(%0).
This transformation is only valid when xxx (the rest of your text + code).
...
TODO: instead of a check + failure, insert new iter_args each time a producer is fused into a consumer and fold away unused iter_args.
288

so, just take_front(sliceOpDepth) ?

291

This also seems like something that is only valid when you know the enclosing loop structure has been produced exactly by the tiling you expect and nothing has changed in the meantime?

gysit abandoned this revision.Sep 14 2021, 8:36 AM