This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Tensor] Add rewrites to extract slices through `tensor.collape_shape`
ClosedPublic

Authored by christopherbate on Jul 13 2022, 3:03 PM.
Tokens
"Like" token, awarded by chongxing.

Details

Summary

This change adds a set of utilities to replace the result of a
tensor.collapse_shape -> tensor.extract_slice chain with the
equivalent result formed by aggregating slices of the
tensor.collapse_shape source. In general, it is not possible to
commute extract_slice and collapse_shape if linearized dimensions
are sliced. The i-th dimension of the tensor.collapse_shape
result is a "linearized sliced dimension" if:

  1. Reassociation indices of tensor.collapse_shape in the i'th position is greater than size 1 (multiple dimensions of the input are collapsed)
  2. The i-th dimension is sliced by tensor.extract_slice.

We can work around this by stitching together the result of
tensor.extract_slice by iterating over any linearized sliced dimensions.
This is equivalent to "tiling" the linearized-and-sliced dimensions of
the tensor.collapse_shape operation in order to manifest the result
tile (the result of the tensor.extract_slice). The user of the
utilities must provide the mechanism to create the tiling (e.g. a loop).
In the tests, it is demonstrated how to apply the utilities using either
scf.for or scf.foreach_thread.

The below example illustrates the pattern using scf.for:

%0 = linalg.generic ... -> tensor<3x7x11x10xf32>
%1 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : ... to tensor<341x10xf32>
%2 = tensor.extract_slice %1 [13, 0] [10, 10] [2, 1] : .... tensor<10x10xf32>

We can construct %2 by generating the following IR:

%dest = linalg.init_tensor() : tensor<10x10xf32>
%2 = scf.for %iv = %c0 to %c10 step %c1 iter_args(%arg0) -> tensor<10x10xf32> {
   // Step 1: Map this output idx (%iv) to a multi-index for the input (%3):
   %linear_index = affine.apply affine_map<(d0)[]->(d0*2 + 11)>(%iv)
   %3:3 = arith.delinearize_index %iv into (3, 7, 11)
   // Step 2: Extract the slice from the input
   %4 = tensor.extract_slice %0 [%3#0, %3#1, %3#2, 0] [1, 1, 1, 10] [1, 1, 1, 1] :
         tensor<3x7x11x10xf32> to tensor<1x1x1x10xf32>
   %5 = tensor.collapse_shape %4 [[0, 1, 2], [3]] :
         tensor<1x1x1x10xf32> into tensor<1x10xf32>
   // Step 3: Insert the slice into the destination
   %6 = tensor.insert_slice %5 into %arg0 [%iv, 0] [1, 10] [1, 1] :
         tensor<1x10xf32> into tensor<10x10xf32>
   scf.yield %6 : tensor<10x10xf32>
}

The pattern was discussed in the RFC here: https://discourse.llvm.org/t/rfc-tensor-extracting-slices-from-tensor-collapse-shape/64034

Diff Detail

Event Timeline

christopherbate requested review of this revision.Jul 13 2022, 3:03 PM
christopherbate edited the summary of this revision. (Show Details)Jul 13 2022, 3:07 PM
mravishankar requested changes to this revision.Jul 13 2022, 4:51 PM

High level comment, reshape-like operations should not really be part of tiling interface... They are really metadata operations and not compute operations. So I think a bit more context on the use case of this would be useful. Adding a tiling interface implementation for what should be just metadata operations leads to some unintended consequences (from experience having done exactly this for tensor.extract_slice and tensor.insert_slice during incubation in IREE).

This revision now requires changes to proceed.Jul 13 2022, 4:51 PM

High level comment, reshape-like operations should not really be part of tiling interface... They are really metadata operations and not compute operations. So I think a bit more context on the use case of this would be useful.

Probably the biggest reason to allow this sort of transformation as part of the normal tile-and-fuse flow is that it allows you to reduce multiple dimensions as a single dimension without blocking pulling producers into the loop nest.

I wrote this diff in two ways -- one being just adding it to tiling interface, the other adding a standalone rewrite to do the generateResultTileValue and adding a test pass, without hooking into TilingInterface. The first way ends up being less code and doesn't require extra test pass to demonstrate the end-to-end result.

I'll push up the other method if you think that's better.

Probably the biggest reason to allow this sort of transformation as part of the normal tile-and-fuse flow is that it allows you to reduce multiple dimensions as a single dimension without blocking pulling producers into the loop nest.

Totally agreed on this point. There was co-incidentally a similar discussion about this on IREE discord. Essentially you want to allow tile and fuse to work through reshapes. Currently the way tile + fuse is implemented is you use the tensor.extract_slice to find the tile of the producer that you need to compute in place. But that requires the producer to be an operation that implements the TilingInterface. If you tensor.extract_slice comes from a result of a reshape, then the fusion stops. This is probably what you are referring to as well. Instead, you can fold the reshape and tensor.extract_slice to get a tensor.extract_slice of the source of the reshape. Then the fusion proceed as normal. I think this folding pattern is the missing component. If we have that we can then figure out how to add it to the mix and get tile and fuse to work through reshapes....

Use a dedicated rewrite pattern to replace the tensor.extract_slice rather than using an implementation of TilingInterface for tensor.collapse_shape.

Totally agreed on this point. There was co-incidentally a similar discussion about this on IREE discord. Essentially you want to allow tile and fuse to work through reshapes. Currently the way tile + fuse is implemented is you use the tensor.extract_slice to find the tile of the producer that you need to compute in place. But that requires the producer to be an operation that implements the TilingInterface. If you tensor.extract_slice comes from a result of a reshape, then the fusion stops. This is probably what you are referring to as well. Instead, you can fold the reshape and tensor.extract_slice to get a tensor.extract_slice of the source of the reshape. Then the fusion proceed as normal. I think this folding pattern is the missing component.

I briefly checked out your discussion, and it appears that we are after the same thing.

If we have that we can then figure out how to add it to the mix and get tile and fuse to work through reshapes....

I updated the diff so that the pattern is implemented as a dedicated rewrite with a public method to directly produce the equivalent result of a tensor.collapse_shape -> tensor.extract_slice chain.

Maybe we need a SliceableInterface in addition to TilingInterface. Then it would be easier to hook into the tile and fuse flow if the idea is that only computation operations implement TilingInterface.

We would also need a control mechanism for tile and fuse to allow the caller to stop the greedy fusion early if desired. Were you already planning on adding a mechanism for that?

I updated the diff so that the pattern is implemented as a dedicated rewrite with a public method to directly produce the equivalent result of a tensor.collapse_shape -> tensor.extract_slice chain.Maybe we need a SliceableInterface in addition to TilingInterface. Then it would be easier to hook into the tile and fuse flow if the idea is that only computation operations implement TilingInterface. We would also need a control mechanism for tile and fuse to allow the caller to stop the greedy fusion early if desired. Were you already planning on adding a mechanism for that?

Thanks! I'll take a look. There is a lot to unpack here and I need to understand how this works a bit more. Maybe @nicolasvasilache might already have some ideas here.
With respect to SliceableInterface, its along the lines what Nicolas had suggested a while back to, but more in terms of a "SubSetInterface" where you can define how to extract/insert subset of data (not necessarily rectangular or contiguous). Not sure if anything has already been done on that.
For control, I think the greedy pattern is more for "demonstration". Any control mechanism will probably use the guts of things and be wrapped within the transform dialect. Its really hard IMO to provide a universal control mechanism, but one thing that can work is having a callback that allows callers to decide if a slice has to be used for fusion or not. I didnt add that cause I wasnt sure how it would interact with the transform dialect approach, but I have used such callbacks in a different context, and they did the job.

mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
27

Add support for using scf.foreach_thread instead of an scf.for loop nest.
Refactor the implementation to deal with added complexity of multiple loop types.

christopherbate retitled this revision from [mlir][Tensor] Add a TilingInterface implementation for `tensor.collapse_shape`. to [mlir][Tensor] Add rewrites to extract slices through `tensor.collape_shape`.Jul 14 2022, 9:51 PM
christopherbate marked an inline comment as done.
christopherbate edited the summary of this revision. (Show Details)Jul 18 2022, 12:25 PM
nicolasvasilache accepted this revision.Jul 19 2022, 7:42 AM

Very cool stuff, thanks @christopherbate this is something that has been missing for a long time!

mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
32

Can you move the helpers to include/mlir/Dialect/Utils/ReshapeOpsUtils.h for better visibility?

69

Probably worth it for this one to inspect the tensor dim and immediately produce IndexAttr(size).
Unfortunately, createOrFold always returns a Value atm.

140

nit: Plz reflow comment

141

nit: space after for

142

I think we can move to SmallVector<OpFoldResult> lbs, ubs, steps; thanks to recent additions that allow creating more folded expressions than before.

151

You now have makeComposedFoldedAffineApply that takes OpFoldResult and does the right thing under the hood.

196

Why the difference with the above ?
I personally prefer the mixed s0 + d0 * s1 form everywhere if possible.

226

I would also move this to the reshape utils for visibility, I anticipate this will become a more important use case over time.

239

nit: llvm::append_range(extractOffsets, llvm::map_range()) may be a bit nicer.

251

seemslike this is the impl you want for your getShapeDimSizes above and then you can just reuse here ?

279

Must the strides always be all 1s ?
Or should you pass them as an extra argument too?

288

I'll comment on your other PR, but I would either:

  • move this op to arith and keep the operands as they are, or
  • keep it in tensor but then take an actual tensor as the basis and abstract away the fact that you get the Dims as a lowering detail.

Unless you have a stronger argument for keeping the semantics as is that I don't see yet?

352

nit: values.

mravishankar resigned from this revision.Jul 19 2022, 9:24 PM

Left a few comments. Removing my blocker on this. Thanks!

mravishankar added inline comments.Jul 19 2022, 9:24 PM
mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
39

It might be better to make the return type Optional<llvm::SmallBitVector> or FailureOr<llvm::SmallBitVector> and return errors instead of (void)-ing them this way.
Also not clear to me what is the expected behavior if the shape are dynamic. I am expecting that it would return failure that needs to be propagated back up the stack.

111

Just out of curiosity, is there a solution space where we dont use scf.for or scf.for_each here? Its unfortunate coupling of these loop constructs with the transformation.

345

I dont think you need to cast it to the interface op. The method is implemented on tensor.extract_slice itself. You should be able to call it directly.

This revision is now accepted and ready to land.Jul 19 2022, 9:24 PM
christopherbate marked 13 inline comments as done.

Address comments

christopherbate marked an inline comment as done.Jul 20 2022, 8:00 PM
christopherbate added inline comments.
mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
111

So we need some sort of intermediate construct that handles the discontinuous iteration space. I just chose these loops because they seem commonly used.

I'm not sure whether you're asking

  1. whether the implementation can be improved to be more generic to the loop op kind or
  2. whether there's a better abstraction that can be used instead of a loop.

For 1, maybe. But it's probably the same amount of effort to add a method called buildWithMyLoopOp to the implementation class.

For 2, yes, I think there is more opportunity to fill missing abstractions here. If the linear dim size is 128 and you don't insert a loop, then instead you could directly insert an "unrolled" form of 128 iterations of what you originally put in the body. Of course, in this form it is too unwieldy. One option would be to add an op equivalent to a higher-order tensor.extract_slice and/or vector.transfer_read/write: it maps some iterations space to a set of slices. That may be easier to further transform versus the loop form.

mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
111

Yes, we will need new types and ops to handle the discontinuity and that intercepts gather/scatter/parallel_scatter and related abstractions.
This is much longer-term though.

it maps some iterations space to a set of slices -> yes the piecewise discontinuities are important to handle right, I am unclear yet what a good abstraction for this looks like but it will be a major focus or ours for the rest of the year.

christopherbate marked 3 inline comments as done.

Rebase

chongxing added a subscriber: chongxing.
mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
39

+1 to FailureOr anywhere we need to propagate errors.

christopherbate edited the summary of this revision. (Show Details)

Refactor to remove dependence on scf.for/scf.foreach_thread and caller provides iteration mechanism. Fix a couple bugs and improve formatting of tests.

christopherbate edited the summary of this revision. (Show Details)Aug 19 2022, 3:10 PM
nicolasvasilache accepted this revision.Aug 29 2022, 4:02 AM

Thanks!

mlir/include/mlir/Dialect/Tensor/Transforms/Transforms.h
39 ↗(On Diff #454114)

Could we please add some spacing and structuring to the comments?

... by the caller.

The class provides two methods:
  1. create: short description
  2. emitExtractSliceFromCollapseShapeBody: short description

Intended use case:
==============
The caller should first ... 

Then the caller should ...

Example:
=======
zzz
77 ↗(On Diff #454114)

This looks more like it belongs in a TransformUtils.h

77 ↗(On Diff #454114)

s/Builder/Helper

I find this class is a helper that exposes methods. Such methods take a builder to do X.
So the class itself is not a Builder, and we since that name has heavy implications in MLIR, we should avoid reusing it IMO.

114 ↗(On Diff #454114)

More structure to the doc here too plz:

... by:
  1. 
  2. 
  3.

The returned pair ...
114 ↗(On Diff #454114)

As in the other PR I commented on, I think inverting may be misleading, we are really propagating through the CollapseShapeOp.

In a followup, I imagine we could refactor and reuse some of the recent improvements introduced in: https://reviews.llvm.org/D128986 ?

mlir/include/mlir/Dialect/Utils/ReshapeOpsUtils.h
392

this class provides ?

394

The name ExtractShapeExtractSlice is hard to relate to.
Also, this not a Builder.

Something like CollapseLinearizationHelper ?

409

Better structuring of the docc here would also help readability (e.g. isolating pieces of equations better, having paragraphs)

409

invert -> fold plz

411

Example in IR would help to relate the proper pieces, the description is a bit too abstract to follow.

mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
95

With C++17 I think it becomes nicer to just:

SmallVector<Range> ranges;
ranges.reserve(...);
for (const auto & [o, s, t] : llvm::zip(...))
  ranges.push_back(Range{o, s, t});
151

see, it's a helper :)

mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp
141

Fine for a test but if it want to be hoisted, this should prob be 1 pattern per loop op type.
Maybe add a TODO?

christopherbate marked 6 inline comments as done.

Address comments

christopherbate marked 4 inline comments as done.Aug 31 2022, 11:35 AM
christopherbate added inline comments.
mlir/include/mlir/Dialect/Tensor/Transforms/Transforms.h
77 ↗(On Diff #454114)

Moved to new file TransformUtils.h and renamed ExtractSliceFromCollapseShapeHelper.

114 ↗(On Diff #454114)

As in the other PR I commented on, I think inverting may be misleading, we are really propagating through the CollapseShapeOp.

I don't follow how inverting is misleading.

In x -> CollapseShape -> y -> ExtractSlice -> z , each of x,y,z are different Tensors with different spaces defined by their shapes. The CollapseShape and Extractslice define maps that map one coordinate to another. The effect of this function is to take coordinates in z, which are given by the tile coordinates / loop induction variables, and map them back to coordinates in x. I tried to make this point of view explicit in the implementation by using functions invertSliceIndexing and invertCollapseShapeIndexing.

In a followup, I imagine we could refactor and reuse some of the recent improvements introduced in: https://reviews.llvm.org/D128986 ?

There are some common aspects that could be factored out to unify these index space manipulation ideas. For example, extracting an element from a collapse-shape'd memref could be phrased as inverting the index transformation of the collapse, which is exactly what thy are doing with the lienarize/delinearize. They could also use the new AffineDelinearizeOp in order to simplify some of the code.

The two implementations would be the same except for one aspect: all the complexity in this use case comes from the fact that we have to carry around information about which dimensions are sliced and/or linearized in order to make the generated code efficient. In the linked diff, they don't have to worry about the effect of the slicing aspect, as that code is only dealing with loading individual elements. Even in the body of the loop generated by this function, we are not guaranteed to just load scalars. Rather, you need to also account for the potential of a) slicing a non-linearized dimension (in which case you don't need a loop dimension) or b) taking the entirety of a linearized dimension. You need this information at basically every point in this transformation. If we could discard the slicing aspect, then it would be greatly simplified into the special case of just loading scalar elements.

mlir/include/mlir/Dialect/Utils/ReshapeOpsUtils.h
394

Oops, that was supposed to be CollapseShapeExtractSlice. fixed.

409

I changed it, but please see my other comment on the rationale for using "invert" here.

mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
95

agreed, fixed

christopherbate marked 2 inline comments as done.Aug 31 2022, 11:38 AM
christopherbate added inline comments.
mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp
141

I broke it into two patterns.

christopherbate marked an inline comment as done.

Rebase

nicolasvasilache added inline comments.
mlir/include/mlir/Dialect/Tensor/Transforms/Transforms.h
114 ↗(On Diff #454114)

I don't follow how inverting is misleading.

Ok, then I misunderstood this from the doc.
Could we improve the doc with better structuring and some IR example that makes the inversion pop up very clearly?

Expand the documentation for the main class that is added. I checked the result of the Doxygen build to make sure it looks correct with relevant hyperlinks and headings correctly generated.

Shortened some function/class names.

mlir/include/mlir/Dialect/Tensor/Transforms/Transforms.h
114 ↗(On Diff #454114)

Ok, I've expanded the IR example to be more complicated and added a step-by-step commentary that explains what is happening, including why I'm saying invert/inversion here.
Let me know if you still want me to adjust the wording.

I've checked the output of this comment block in the Doxygen build as well to ensure all the formatting shows up correctly.

nicolasvasilache accepted this revision.Sep 1 2022, 6:32 AM

Looks great, thanks for all the doc !

mlir/include/mlir/Dialect/Tensor/Transforms/TransformUtils.h
43

nit: "the the"

Fix typo in doc

christopherbate marked an inline comment as done.Sep 2 2022, 8:19 AM

Windows bot seems to be failing for unrelated reason

This revision was landed with ongoing or failed builds.Sep 2 2022, 10:29 AM
This revision was automatically updated to reflect the committed changes.

There is an issue of circular dependency here: you're introducing a dependency from Dialect/Utils/ to the ViewLikeInterface, but it already depends on Dialect/Utils

mehdi_amini added inline comments.Sep 2 2022, 4:34 PM
mlir/lib/Dialect/Tensor/Transforms/ExtractSliceFromReshape.cpp
17

On top of the cyclic dependency mentioned elsewhere, this introduces a dependency on Linalg which isn't defined in CMake as far as I can tell: this is breaking the bots.

Also I am not sure that it is conceptually correct to have lib/Dialect/Tensor to depend on Linalg?

I'll revert for now.

There is an issue of circular dependency here: you're introducing a dependency from Dialect/Utils/ to the ViewLikeInterface, but it already depends on Dialect/Utils

Thanks, I should have checked for that.

christopherbate reopened this revision.EditedSep 8 2022, 1:01 PM

Depends on https://reviews.llvm.org/D133523 to resolve circular dependency with ViewLikeInterface

This revision is now accepted and ready to land.Sep 8 2022, 1:01 PM

Rebase on D133523, resolve linalg dependency issue.

Verified Bazel build is working and added required Bazel build file changes