This is an archive of the discontinued LLVM Phabricator instance.

[mlir][bufferization] Improve analysis for element-wise operation
ClosedPublic

Authored by springerm on Aug 2 2023, 6:26 AM.

Details

Summary

Before this change, two equivalent operands that bufferize to a memory read and write, respectively, were always conflicting. This change improves the analysis for ops that bufferize to element-wise access. Such ops can bufferize in-place, because an original element value is not needed anymore after computing and writing an updated element value.

This change allows ops such as the following one to bufferize in-place:

%0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
    ins(%a, %b : tensor<5xf32>, tensor<5xf32>)
    outs(%a : tensor<5xf32>) -> tensor<5xf32>

Diff Detail

Event Timeline

springerm created this revision.Aug 2 2023, 6:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 6:26 AM
springerm requested review of this revision.Aug 2 2023, 6:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 6:26 AM
springerm edited the summary of this revision. (Show Details)Aug 3 2023, 2:31 AM
steeve added a subscriber: steeve.EditedAug 3 2023, 3:42 AM

Hi,

Thank you for this quick patch!
I am running into this issue ("wrong allocation"). OP's patch works, but only if both operands are of the same type. If one of those if an arith.constant, then allocation occurs. I believe this is because areEquivalentBufferizedValues fails.

Sample MLIR:

func.func @relu(%a: tensor<5xf32>) -> tensor<5xf32> {
    %c0f = arith.constant 0.0 : f32
    %0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
        ins(%a, %c0f : tensor<5xf32>, f32)
        outs(%a : tensor<5xf32>) -> tensor<5xf32>
    return %0 : tensor<5xf32>
}
springerm updated this revision to Diff 546836.Aug 3 2023, 6:34 AM
springerm edited the summary of this revision. (Show Details)

address comments: better analysis

steeve added a comment.EditedAug 3 2023, 7:08 AM

Thank you!
This now works with the example, but fails if the tensor is somehow 32xf32 instead of 5xf32

!TC = tensor<32xf32>
func.func @relu(%a: !TC) -> !TC {
    %c0f = arith.constant 0.0 : f32
    %0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
        ins(%a, %c0f : !TC, f32)
        outs(%a : !TC) -> !TC
    return %0 : !TC
}

EDIT: it fails when the tensor is higher than 8xf32, meaning the first failing case is 9xf32, no matter how many dimensions (8x8x8x8 works, 9x9x9x9 doesn't)

Thank you!
This now works with the example, but fails if the tensor is somehow 32xf32 instead of 5xf32

!TC = tensor<32xf32>
func.func @relu(%a: !TC) -> !TC {
    %c0f = arith.constant 0.0 : f32
    %0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
        ins(%a, %c0f : !TC, f32)
        outs(%a : !TC) -> !TC
    return %0 : !TC
}

EDIT: it fails when the tensor is higher than 8xf32, meaning the first failing case is 9xf32

Hmm I cannot reproduce this. Did you run it as part of the test case that I added? It bufferizes in-place on my machine:

func.func @relu(%arg0: tensor<32xf32> {bufferization.access = "read-write"}) -> tensor<32xf32> {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = linalg.elemwise_binary {__inplace_operands_attr__ = ["true", "none", "true"], fun = #linalg.binary_fn<add>} ins(%arg0, %cst : tensor<32xf32>, f32) outs(%arg0 : tensor<32xf32>) -> tensor<32xf32>
  return {__equivalent_func_args__ = [0], __inplace_operands_attr__ = ["true"]} %0 : tensor<32xf32>
}
nicolasvasilache accepted this revision.Aug 3 2023, 7:25 AM

Thanks!

mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td
100

nit: in compiler linguo we'd also say "free of loop-carried dependences"

102

typo: hypothetical

140

that the op is not element-wise. -> that the op is element-wise. ?

mlir/lib/Dialect/Linalg/Transforms/BufferizableOpInterfaceImpl.cpp
118

move this TODO to l129 ?

125

Technically this would break if we allowed mixed tensor/buffer (which we did in the past).
Guard against memref types too?

This revision is now accepted and ready to land.Aug 3 2023, 7:25 AM
springerm marked 5 inline comments as done.Aug 3 2023, 7:32 AM
springerm added inline comments.
mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td
140

This is correct as is. Being element-wise enable an additional optimization. If we don't know for sure, we don't optimize.

springerm marked an inline comment as done.Aug 3 2023, 7:36 AM
This revision was landed with ongoing or failed builds.Aug 3 2023, 7:36 AM
This revision was automatically updated to reflect the committed changes.
mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td
140

ah yes sorry, confused myself over nothing :)

steeve added a comment.Aug 3 2023, 7:52 AM
This comment was removed by steeve.
steeve added a comment.EditedAug 3 2023, 8:00 AM

Here is a full example, the problem arises when the tensor dimension is bigger than the tile size in transform.structured.fuse:

!TT = tensor<32xf32>
func.func @elementwise_no_conflict(%a: !TT,
                                   %b: !TT) -> !TT {
    %c0f = arith.constant 0.0 : f32
    %0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
        ins(%a, %c0f : !TT, f32)
        outs(%a : !TT) -> !TT
  return %0 : !TT
}

transform.sequence failures(propagate) {
^bb0(%module: !transform.any_op):
    %max = transform.structured.match ops{["linalg.elemwise_binary"]} in %module
        : (!transform.any_op) -> !transform.any_op

    %max_tiled, %loops:2 = transform.structured.fuse %max { tile_sizes = [8, 8] } // <----------
        : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)

    %module_2 = transform.bufferization.one_shot_bufferize
        layout{IdentityLayoutMap} %module
        { bufferize_function_boundaries = true, allow_return_allocs = true }
        : (!transform.any_op) -> !transform.any_op

    transform.yield
}

Run with:

$ mlir-opt test.mlir --test-transform-dialect-interpreter --test-transform-dialect-erase-schedule

Here is a full example, the problem arises when the tensor dimension is bigger than the tile size in transform.structured.fuse:

!TT = tensor<32xf32>
func.func @elementwise_no_conflict(%a: !TT,
                                   %b: !TT) -> !TT {
    %c0f = arith.constant 0.0 : f32
    %0 = linalg.elemwise_binary {fun = #linalg.binary_fn<add>}
        ins(%a, %c0f : !TT, f32)
        outs(%a : !TT) -> !TT
  return %0 : !TT
}

transform.sequence failures(propagate) {
^bb0(%module: !transform.any_op):
    %max = transform.structured.match ops{["linalg.elemwise_binary"]} in %module
        : (!transform.any_op) -> !transform.any_op

    %max_tiled, %loops:2 = transform.structured.fuse %max { tile_sizes = [8, 8] } // <----------
        : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op)

    %module_2 = transform.bufferization.one_shot_bufferize
        layout{IdentityLayoutMap} %module
        { bufferize_function_boundaries = true, allow_return_allocs = true }
        : (!transform.any_op) -> !transform.any_op

    transform.yield
}

Run with:

$ mlir-opt test.mlir --test-transform-dialect-interpreter --test-transform-dialect-erase-schedule

There's probably an alloc already in some shape or form (tensor.empty, bufferization.alloc_tensor, etc.) before it hits bufferization.