This is an archive of the discontinued LLVM Phabricator instance.

[mlir][TilingInterface] Add mode to tile + fuse that allows yielding the replacements for fused producers.
AbandonedPublic

Authored by mravishankar on Sep 8 2022, 12:13 AM.

Details

Summary

Tile and fuse using TilingInterface allows fusing producers with
consumers at tile granularity, i.e. using a tiled implementation of
the producer to compute the tile needed for producer a tile of the
consumer. Current implementation of tile and fuse only yields the
value to replace the consumer. In some cases it is also useful to have
the loop nest yield a value to replace the tiled + fused producer.
This is only valid to do in cases where the producer is not computed
in a redundant fashion, i.e. after tiling the innermost loop body
produces a disjoint portion of the producer. This patch modifies the
implementation of tile and fuse to support this use case.

Currently there is no way to know automatically if there are redundant
computations introduced while fusing the producer. So a new option is
added which allows the caller to assert that this is indeed the case,
i.e. the tile sizes selected for the consumer are such that this
assumption can be made. Under this option, the generated tiled loop
nest also yeilds a value that can be used to replace the untiled
producer.

Diff Detail

Event Timeline

mravishankar created this revision.Sep 8 2022, 12:13 AM
mravishankar requested review of this revision.Sep 8 2022, 12:13 AM
nicolasvasilache requested changes to this revision.Sep 8 2022, 3:26 AM

Putting a blocker until I had time to review.
At first sight this looks even more complicated than https://reviews.llvm.org/D132085 for which the consensus was to break into small independent pieces, so without going into details yet this raises a few red flags for me.

This revision now requires changes to proceed.Sep 8 2022, 3:26 AM
mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
67

Quick fly-by comment.

Genereally we want to go towards SubsetInsert/Extract interfaces, committing more APIs to offsets + sizes is not where we want to go IMO.

Putting a blocker until I had time to review.
At first sight this looks even more complicated than https://reviews.llvm.org/D132085 for which the consensus was to break into small independent pieces, so without going into details yet this raises a few red flags for me.

A lot of it is refactoring, and happy to iterate, but AFAICS it is definitely much less complicated than the earlier attempt. I'll leave some high level pointers here hoping it helps in reviewing

  1. Most the logic is done by three methods
    • tileConsumer which generates the tile loop nest and tiles the consumer op. This is mostly refactoring.
    • fuseProducer which tiles and fuses a producer with all currently tiled and fused operations. While the implementation itself is done using swap of op -> tensor.extract_slice pattern, but refactoring the logic into using an "operation centric" view makes it much easier to handle the case where the producer values also needs to be produced by the tiled loop.
    • yieldTiledValues which modifies loop nest to yield values from within the tiled loop body. THe key here is that neither tileConsumer nor fuseProducer does this, i.e. after each of those calls the scf.for nest created does not yield anything yet. The reason being having/manipulating iter_args is extremely involved, and hard to reason about. Instead the core transformations just create the loop nest and body, and this method orchestrates the yield. This separation makes sense to me since the use of iter_args etc. is very tied to use of scf.for. So while the first two methods are not really tied to use of scf.for (any looping construct would work), this method manages the complications with iter_args.

The cadence of the transformation is fairly easy to follow (see the returningMatchAndRewrite methods for the tiling pattern and tile + fuse pattern). The rest is just details. I am still working out a few kinks in those details, but the generated code matches what I would expect. So in my mind this decomposition makes sense. I am happy to iterate further. Each of these methods have a lot of detail, but "what they do" is fairly well defined. Note that the part of yielding replacements for the fused producer is all gaurded by a flag. The rest of the changes are just NFC. I am happy to stress test it and see where it leads, drop it if there is a better way to do it. Personally I am confident this approach is more maintainable. This is my second attempt at this :) , and is a real blocker at this point (cant kick this can further).

Remove dropping of dead yields from tiled loop. That needs to be revisited later.

mravishankar abandoned this revision.Sep 18 2022, 1:12 PM

Abandoning this one to try a different approach.