Page MenuHomePhabricator

[mlir] Add ComprehensiveBufferize pass for function and modules (step 1/n)
ClosedPublic

Authored by nicolasvasilache on May 1 2021, 9:48 AM.

Details

Summary

This is the first step towards upstreaming comprehensive bufferization following the
discourse post: https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373/6.

This first commit introduces a basic pass for bufferizing within function boundaries,
assuming that the inplaceable function boundaries have been marked as such.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
herhut added inline comments.May 11 2021, 3:46 AM
mlir/lib/Transforms/ComprehensiveBufferization.cpp
54 ↗(On Diff #342159)

Is this legal IR? Is it not a use after free?

61 ↗(On Diff #342159)

So this assumes that the prelude with the allocation can always be hoisted to the caller. Is this correct (just making sure I understand right)?

129 ↗(On Diff #342159)

What does this do?

303 ↗(On Diff #342159)

Is the problem here that every user creates its own alloc (and hence the pattern to recognize to hoist these out to the caller gets messy) or that you have multiple copies?

Ultimately, you want to hoist the alloc and copy out, but only if they appear on all paths through the function, right? As a post optimization this would be hoisting copies + allocs. How do you envision this here? Collect this information before the bufferization in form of attributes?

358 ↗(On Diff #342159)

How do you envision this to compose? This is the equivalent of bufferization patterns for other dialects, right?

400 ↗(On Diff #342159)

So you assume that there is a single block for the entire computation that you are allocation here? Or at least that no allocated memory escapes the block?

Is this the reason for the tensor.load in the example comment? So cross-block flow is expected to be handled outside of this transformation?

481 ↗(On Diff #342159)

I think it is the wrong way round, this should communicate the updates tmpSliceRef back to the reference parameter if the match succeeded.

684 ↗(On Diff #342159)

Why is this different form the code below?

745 ↗(On Diff #342159)

Can we lift this to an arbitrary region with a single block and terminator? That would allow the use in other contexts, too.

nicolasvasilache marked 29 inline comments as done.

Address reviews.

thanks for reviewing despite the WIP state!

mlir/lib/Transforms/ComprehensiveBufferization.cpp
54 ↗(On Diff #342159)

It is illegal IR and if it cannot be legalized by hoisting, the pass will need to fail.

61 ↗(On Diff #342159)

It is meant to support the cases where the hoisting can happen.
Cases where the hoisting can't happen are unsupported by this pass at this time.

129 ↗(On Diff #342159)

nothing atm

159 ↗(On Diff #342159)

It is a real condition: getTiedOpResult may return null as it is perfectly fine that certain op operands do not map to results.

303 ↗(On Diff #342159)

The latter (i.e. you have multiple copies). When no interfering reads occur, we could reuse the same buffer but the current impl does not perform the optimization atm.

358 ↗(On Diff #342159)

not too sure how to parse this sentence, I just refactored as a DimOp bufferization pattern and deleted this.
This was a remnant from experimenting with other traversal orders.

400 ↗(On Diff #342159)

Yes, it is a best effort pass and branchy control-flow is generally supported.
In particular this pass has no intention of generalizing to the graph level and may just fail.
Added some more doc at the top-level to make it more explicit.

481 ↗(On Diff #342159)

yes, thanks!

565 ↗(On Diff #342159)

older cruft from when I used a backwardSlice, thanks!

659 ↗(On Diff #342159)

I feel this is premature, this pass is not meant to generalize to an open set of ops atm and I don't know that there is such a generic interface that makes sense for other ops.
Note that this is a "non-conversion" + inplace-aware specialization of the linalg bufferization pattern used in: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/Linalg/Transforms/Bufferize.cpp#L191.
There could technically be refactorings that allow BlockAndValueMapping-based code to be reused by conversion patterns but it feels premature here.

684 ↗(On Diff #342159)

cruft from refactorings, thanks!

745 ↗(On Diff #342159)

I will think about it as I flush my stack and try as a followup, I have multiple refactorings in flight and the generalization should be separable.

779 ↗(On Diff #342159)

in my other repo :)

798 ↗(On Diff #342159)

right, the original pass was a full Module pass, I'll reinsert at the cross function boundary when it makes sense.

Hmm phab. did not like the file moving, a few unaddressed comments re interfaces dropped.

Add InplaceInterface.

Thanks for the replies.

mlir/lib/Transforms/ComprehensiveBufferization.cpp
303 ↗(On Diff #342159)

Still don't get this.

if you have a non-reusable input argument you create a copy and then mutate the copy, correct? I don't see how a single copy fits in here? If you have multiple cases of in place updates, you also need multiple copies so that you can materialize the updated value somewhere.

If the updates lie on different control flow paths, having a copy per path is not worse than a single shared copy (except for code size maybe). But that is not supported yet anyway, right?

358 ↗(On Diff #342159)

That answers my question. I was wondering how to enable bufferizing other operations and whether we would need more static functions for those. If they are patterns that are provided by the context, that should work.

nicolasvasilache marked 4 inline comments as done.May 12 2021, 5:50 AM
nicolasvasilache added inline comments.
mlir/lib/Transforms/ComprehensiveBufferization.cpp
303 ↗(On Diff #342159)

A single control flow path is supported so that part is out of the equation.

Imagine:

func @foo(%A: tensor<...>) {
  %B = op1(%A)
  op2(%B) // some print.
  %C = op3(%A)
  return %C
}

%A will bufferize to a new alloc %bA; after op2 we are done with the uses of the value %B.
If op1 and op3 were inpleaceable, we could reuse %bA as the buffer for %C too but atm this optimization is not supported and a new alloc would trigger for the use of %A in op2.

472 ↗(On Diff #342159)

it's the same as the one needed to drop getTiedOpResult for OpOperand.

nicolasvasilache marked 2 inline comments as done.

Address more review.

nicolasvasilache retitled this revision from [WIP][mlir] Add ComprehensiveBufferize pass for function and modules (step 1/n) to [mlir] Add ComprehensiveBufferize pass for function and modules (step 1/n).May 12 2021, 5:54 AM
nicolasvasilache edited the summary of this revision. (Show Details)
nicolasvasilache edited the summary of this revision. (Show Details)
nicolasvasilache marked an inline comment as done.May 12 2021, 6:00 AM
nicolasvasilache added inline comments.
mlir/lib/Transforms/ComprehensiveBufferization.cpp
164 ↗(On Diff #342159)

I'd like to keep this for a followup (and ideally farm out as an intro task), if you don't mind?

nicolasvasilache marked an inline comment as done.
nicolasvasilache edited the summary of this revision. (Show Details)

Add cmake.

ezhulenev added inline comments.
mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferization.cpp
62 ↗(On Diff #344800)

Is it legal to memref.tensor_load after %2 was deallocated? Or it is intentionally an invalid intermediate IR that will be gone after bufferization?

nicolasvasilache marked an inline comment as done.May 12 2021, 11:35 AM
nicolasvasilache added inline comments.
mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferization.cpp
62 ↗(On Diff #344800)

The latter, there will be a cross-"function/call" analysis that will say such alloc/dealloc pairs must be lifted outside the function at each call site and become new function operands (see comments l67-73)
The pass will fail if this is not feasible for some reason.

rriddle added inline comments.May 12 2021, 11:38 AM
mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferization.cpp
726–727 ↗(On Diff #344800)

Did you mean for one of these to be getResultTypes?

nicolasvasilache marked 2 inline comments as done.

Address comment.

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferization.cpp
726–727 ↗(On Diff #344800)

yes, thanks for catching this!

silvas added inline comments.May 12 2021, 5:15 PM
mlir/include/mlir/Interfaces/InplaceInterface.td
21 ↗(On Diff #344894)

I know some folks have asked for this interface, but I have to disagree here. I don't think this interface is a good idea, because it doesn't really have any semantics independent of a particular bufferization scheme, and in particular the exact buffer op sequence that a tensor op gets lowered to (that is, it's really a way for a pass to "predict" some details about the lowered op sequence and makes no sense in isolation).

Since ComprehensiveBufferization is already doing an explicit enumeration to handle the lowering code, it seems natural that it would have an explicit enumeration of this op property which is tightly coupled to the rewrites.

Or to put it another way, either:

  1. Have a single interface that provides both the getTiedOpResult function and the lowering code that converts the tensor op into a sequence of buffer ops, or
  2. Do explicit case enumeration for getTiedOpResult and for the lowering code and delete this interface.

I strongly prefer 2, because 1. feels contrary to the current intent here of having a scoped-down bufferization for the types of code that linalg transformations generates, to avoid needing to do the heavy lift of generalizing the existing bufferization infra to support inplace, which is more involved due to it's grander scope.

Maybe as we see how users are intending to use this, we will find that some interface is needed for pluggability (e.g. maybe IREE needs some pluggability for a few ops at the ABI boundaries) -- I would rather wait and see what those concrete needs are before creating such an interface (which should be clearly labeled as specific to this pass -- something like ComprehensiveBufferizeOpInterface).

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferization.cpp
1 ↗(On Diff #344894)

nit: can we call this ComprehensiveBufferize.cpp instead of ComprehensiveBufferization.cpp, to keep the term "Bufferize" consistent?

464 ↗(On Diff #344894)

TODO is out of date.

mlir/include/mlir/Interfaces/InplaceInterface.td
21 ↗(On Diff #344894)

Basically the information needed is that OpOperand and OpResult are the same tensor type and size.
This information is unrelated to any transformation and is purely a property of the op.
In the static case, this is trivial and nothing special is required.
In the dynamic case, this is a property of the op and each op spells this differently.

I can change the naming of the interface and the doc to reflect the property above and drop any mention of bufferization.
Would that alleviate your concerns?

Note that this is just a minimal subset of the interface that you have been using on the IREE side: https://github.com/google/iree/blob/22a905c485e3ae2ff9743dc9d195de3b90b08ed5/iree/compiler/Dialect/IREE/IR/IREEInterfaces.td#L24

nicolasvasilache marked 2 inline comments as done.

Address comments.
Simplify inplace patterns.
Fix read interference bug for func args.
Add negative tests.

nicolasvasilache marked an inline comment as done.May 13 2021, 12:51 PM
nicolasvasilache added inline comments.
mlir/include/mlir/Interfaces/InplaceInterface.td
21 ↗(On Diff #344894)

FYI, after discussing with @silvas we settled on reverting back to the pass-local switch for now.

Address comments.

aartbik added inline comments.May 13 2021, 1:36 PM
mlir/lib/Transforms/ComprehensiveBufferization.cpp
155 ↗(On Diff #342159)

Note that this is very similar to the "fastOutput" experimental flag I have been using in the sparse compiler to get "in place" bufferization semantics. I am not sure how we can make the semantics cleaner, but I there is clearly a need for something like this!

silvas accepted this revision.May 13 2021, 2:04 PM

Nicolas and I had a long conversation regarding the "in place op interface". We managed to tease out two separate concepts that are independent.

  1. Whether a tensor result is known to have the same shape, element type, and encoding as an input. This is a property that can be described at the tensor level. A motivating example here is a 10-ary elementwise add op -- all ten inputs have the same shape, element type, and encoding as the input. This type of tensor-level compatibility is necessary to allow a bufferization pass to be "smart" and reuse the storage of one operand for the result, but it is not sufficient for a bufferization pass to safely decide whether an op can be done in place or not.
  1. The ultimate determination of whether it is safe for a tensor op to bufferize to an in place op is strictly a function of the choice of lowering (and cannot be described at the level of a tensor op). E.g. a (data-moving) transpose op could be implemented as for (...) dst[i,j] = src[j,i] in which case it could not be done in place, or it could be implemented with a transpose algorithm that can be done in place (block swapping etc.). This is purely an implementation detail of the lowering that happens during bufferization, and cannot be known independently of the pattern which is used to bufferize the op (you could put such a pattern in an op interface, but that doesn't change the fact that the "inplaceability" is a function of the pattern, not the op). A similar situation occurs with e.g. a square tensor-level matmul D=A*B+C. All of A, B, C, D can have the same shape, element type, and encoding, but whether it is possible to reuse A to hold the storage for D depends on the lowering (and I'm not aware of any matmul lowering that operates in place with LHS/RHS).

Thus, the tensor-level InplaceOpInterface that we had here before was more of an attempt at 1, but since that is not the sufficient condition for the pass, it was deemed a premature generalization to add that (something like that might be useful, or a generalization of that which allows e.g. a "concat" to indicate that the output ).

Indeed, the pass uses an open-coded enumeration of the lowering patterns, so a matching open-coded enumeration for the "inplaceability" information coupled to those lowering patterns makes sense and is consistent with this pass's goals of being fairly isolated and standalone -- we don't want to prematurely add interfaces for something like that until we have a user.

I'm personally ok for this to go in. It's self-contained, and serves as a benchmark for the needed functionality needed if/when we add this kind of behavior to the other bufferization system.

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferize.cpp
132

Document the contract of this: it says which results will be reused inplace by the bufferization patterns in bufferizeFuncOpInternals, so that the analysis can correctly predict the inplaced final result for its safety analysis.

This revision is now accepted and ready to land.May 13 2021, 2:04 PM

Address comment.

This revision was landed with ongoing or failed builds.May 13 2021, 3:25 PM
This revision was automatically updated to reflect the committed changes.
nicolasvasilache marked an inline comment as done.May 13 2021, 3:28 PM