This is an archive of the discontinued LLVM Phabricator instance.

[mlir][bufferize] Partly support memrefs with non-standard layout in `finalizing-bufferize`
ClosedPublic

Authored by springerm on Feb 16 2022, 5:28 AM.

Details

Summary

This change adds support in finalizing-bufferize for IR such as:

%0 = bufferization.to_tensor %m : memref<?xf32, affine_map<(d0)[s0] -> (d0 + s0)>
%1 = bufferization.to_memref %0 : memref<?xf32>

Depending on the exact source/dest type, this folds to a memref.cast or requires a reallocation + copy.

Memref types with non-standard layouts are required to support non-copying tensor.extact_slice ops.

Note: Dest memref types with non-standard layouts are not yet supported.

The goal of this commit is to ease the transition towards One-Shot Bufferization. This change allows us to remove the existing, partial, copying bufferization pattern of tensor.extract_slice and replace it with the non-copying, BufferizableOpInterface-based variant.

Depends On D119937

Diff Detail

Event Timeline

springerm created this revision.Feb 16 2022, 5:28 AM
springerm requested review of this revision.Feb 16 2022, 5:28 AM
springerm added inline comments.
mlir/lib/Transforms/Utils/DialectConversion.cpp
2591–2595 ↗(On Diff #409217)

@ftynse is preparing a standalone revision for this part.

silvas accepted this revision.Feb 17 2022, 12:01 PM
This revision is now accepted and ready to land.Feb 17 2022, 12:01 PM
mamrami added a subscriber: mamrami.Feb 1 2023, 7:10 AM
mamrami added inline comments.
mlir/lib/Dialect/Bufferization/IR/BufferizationOps.cpp
28

It's a bit of delay, but I am using OneShot now and I have a question regarding this line..
Why do we want memory space to match?
In test/Dialect/Tensor/one-shot-bufferize.mlir:268 we have an example of a memref.copy where src and dst have different memory space.
I tried to remove the check in line 31. I fail on canonicalization test since canonicalizer uses this method too.
Is there a difference between canonicalizer & OneShot approaches?

springerm added inline comments.Feb 1 2023, 7:24 AM
mlir/lib/Dialect/Bufferization/IR/BufferizationOps.cpp
28

This function is used for folding to_tensor-to_memref pairs. If the memory space is different, we can't just fold. We could realloc and copy. But that could be expensive and potentially cause memory leaks because we don't have a good story for buffer deallocation at the moment. (Only things that are allocated by One-Shot Bufferize will get deallocated.)

Note when you use One-Shot Bufferize to do a full-function bufferization, you should not run into this issue. We model all memory space copies explicitly via bufferization.alloc_tensor; it has a memory space attribute. At the end all to_tensor/to_memref should fold away. If they don't it would be interesting to see why that is the case.

mamrami added inline comments.Feb 1 2023, 8:27 AM
mlir/lib/Dialect/Bufferization/IR/BufferizationOps.cpp
28

I've been looking at it for quite a while today, so I have some insights.
First, here's a reproducer:

mlir-opt -allow-unregistered-dialect -one-shot-bufferize="bufferize-function-boundaries=1 allow-unknown-ops=1 allow-return-allocs=1 function-boundary-type-conversion=identity-layout-map copy-before-write=1"

Input:

module {
  func.func @foo(%arg0: tensor<16xsi8>) -> tensor<16xsi8> {
    %0 = "my.op"(%arg0) : (tensor<16xsi8>) -> tensor<16xsi8>
    return %0 : tensor<16xsi8>
  }
  func.func @bar(%arg0: memref<16xsi8>) -> memref<16xsi8, 1> {
    %0 = bufferization.to_tensor %arg0 : memref<16xsi8>
    %1 = call @foo(%0) : (tensor<16xsi8>) -> tensor<16xsi8>
    %2 = bufferization.to_memref %1 : memref<16xsi8, 1>
    return %2 : memref<16xsi8, 1>
  }
}

Output:

module {
  func.func @foo(%arg0: memref<16xsi8>) -> memref<16xsi8> {
    %0 = bufferization.to_tensor %arg0 : memref<16xsi8>
    %1 = "my.op"(%0) : (tensor<16xsi8>) -> tensor<16xsi8>
    %2 = bufferization.to_memref %1 : memref<16xsi8>
    return %2 : memref<16xsi8>
  }
  func.func @bar(%arg0: memref<16xsi8>) -> memref<16xsi8, 1> {
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<16xsi8>
    memref.copy %arg0, %alloc : memref<16xsi8> to memref<16xsi8>
    %0 = call @foo(%alloc) : (memref<16xsi8>) -> memref<16xsi8>
    %1 = bufferization.to_tensor %0 : memref<16xsi8>
    %2 = bufferization.to_memref %1 : memref<16xsi8, 1>
    return %2 : memref<16xsi8, 1>
  }
}

You mentioned full-function bufferization, but here bar has to_memref, so maybe this causes this behaviour.
I noticed that to_tensor was inserted when bufferizing the callOp, but it is not inserted to the worklist. So it stays a to_tensor and doesn't bufferize to alloc_tensor and later to alloc.
The downside of adding the to_tensor to the worklist is that for same memory space, the to_tensor - to_memref pair would fold.
Perhaps it can be added to the worklist if it still there after foldToMemrefToTensorPair..

Note to_tensor and to_memref are not supposed to bufferize. They will just fold away. Typically we have input IR such as:

func.func @bar(%arg0: memref<16xsi8>) -> memref<16xsi8, 1> {
  %0 = bufferization.to_tensor %arg0 : memref<16xsi8>
  %1 = call @foo(%0) : (tensor<16xsi8>) -> tensor<16xsi8>
  %realloc = bufferization.alloc_tensor() copy(%1) { memory_space = 1 } : tensor<16xsi8>
  %2 = bufferization.to_memref %realloc : memref<16xsi8, 1>
  return %2 : memref<16xsi8, 1>
}

Now it will just fold after bufferization.

Otherwise, the best we can do is realloc with memory space 1 + copying when folding the to_memref-to_tensor pair.

Your example will work for me, I'll just have to add a pass where I insert this alloc_tensor.
Actually I expected this to happen naturally in insertTensorCopies stage.

Otherwise, the best we can do is realloc with memory space 1 + copying when folding the to_memref-to_tensor pair.

In order to do that I had to delete this check of:

if (srcType.getMemorySpaceAsInt() != destType.getMemorySpaceAsInt())
  return failure();

The rest of the logic is already written and it worked for me. I wanted to upload a patch, but the canonicalizer test failed because it calls the same castOrReallocMemRefValue.
That's why I started the thread :)

Do you think it's ok to allow the realloc in case we come from oneShot?
If so, we need to decide if it is ok to allow it as part of the canonicalizer as well.

Do you think the alloc_tensor should be created at the insertTensorCopies stage?