diff --git a/mlir/docs/Bufferization.md b/mlir/docs/Bufferization.md new file mode 100644 --- /dev/null +++ b/mlir/docs/Bufferization.md @@ -0,0 +1,1168 @@ +# Bufferization on MLIR + +The general task of bufferization is to move SSA values (like tensors) into +allocated memory buffers that have to be freed when they are no longer needed. +This also involves the placement of copies to clone contents of allocated +memory buffers at specific locations (similar to register allocation). On the +one hand, these copies are needed to ensure the right behavior of a program, on +the other hand, introducing several aliases for a certain buffer could lead to +a wrong freeing of buffers. Therefore, we have to take care of them and the +program structure. The introduction of copies solves this problem. Several +unnecessary introduced copies during this process can be eliminated afterwards. + +```mlir +func @func_on_tensors(%source: tensor<4xf32>) -> tensor<4xf32> { + %0 = mhlo.exp %source : (tensor<4xf32>) -> (tensor<4xf32>) + return %0 : tensor<4xf32> +} +``` + +Will be transformed to: + +```mlir +func @func_on_buffers(%source: memref<4xf32>, %target: memref<4xf32>) { + %0 = alloc() : memref<4xf32> + lmhlo.exp %source, %0 : (memref<4xf32>, memref<4xf32>) -> () + lmhlo.copy %0, %target : (memref<4xf32>, memref<4xf32>) -> () + dealloc %0 : memref<4xf32> + return +} +``` + +In general, Bufferization is split into three separate phases: a preparation +phase, a bufferization phase and a post-processing phase. The assignment +process happens during dialect conversion and allocates buffers for each value +that should be moved into a memory buffer. This has to be implemented by each +dialect using the following tools and patterns. Thereby, all operations on +memory buffers have to be changed to `memref` types (see Preparation Phase). +Afterwards, the placement transformation (see BufferDeallocation) adds all +required deallocation operations, temporary buffers and copy operations +automatically. + +## Preparation Phase + +In order to apply the BufferDeallocation code transformation, the input MLIR +program needs to leverage allocations (buffers in general) and `memref` +types(as outlined above). If your input program does not work on buffers, you +need to perform this preparation step in order to port it to the “buffer +world”. This is a user-defined preparation step that is intended to be applied +during dialect conversion. The user has to take care for the right conversion +by providing conversion patterns relying on a type converter to assign buffers. + +A necessary step is to apply type and function signature conversion. +Furthermore, after changing all function signatures, the associated return and +call operations must comply with the new corresponding function signatures. For +this purpose, three operation conversion patterns are introduced: + +* BufferizeFuncOpConverter +* BufferizeReturnOpConverter +* BufferizeCallOpConverter + +In order to use these conversion patterns, the user needs to define a custom +BufferizeTypeConverter implementation. + +### BufferizeTypeConverter + +The BufferizeTypeConverter is an extension to the TypeConverter class that +provides additional functionality for dialect developers to decide on the +signature of a function. The extra functions include: + +* `setResultConversionKind` to decide on a function result after the conversion +with a specific type to be appended to the function argument list as an output +argument or remains as a function result. +* define their own callback function for type or value unwrapping. + +### ResultConversionKind + +ResultConversionKind is an enum with two values + +* AppendToArgumentList +* KeepAsFunctionResult + +that defines how BufferizeFuncOpConverter should handle function results in +order to convert the signature of the function. The other two operation +conversion patterns also use ResultConversionKind to adapt themselves with the +new function signature. + +`ResultConversionKind` can be set using `setResultConversionKind`, which needs +two template parameters that correspond to the types before and after type +conversion. This mapping specifies whether the resulting type should stay as a +function result or should be appended to the arguments list after the +conversion is done. Note that the default value for unspecified mappings is +`KeepAsFunctionResult`. For instance, the following command updates the +`BufferizeTypeConverter` instance that defines all MemRefType function results +(converted from `RankedTensorTypes`). These results should be appended to the +function argument list in BufferizeFuncOpConverter: + +```mlir +converter.setResultConversionKind( + BufferizeTypeConverter::AppendToArgumentsList); +``` + +## Callbacks for Unpacking Types + +```mlir +func @func_on_tensors(%arg0: tuple) -> (tuple, tensor<5xf32>>) { + ... +} +``` + +Will be transformed to: + +```mlir +func @func_on_buffers(%arg0: i1, %arg1: f32, %target0: memref<10xf32>, %target1: memref<5xf32>) { + ... +} +``` + +BufferizeFuncOpConverter is also able to unpack the types of arguments and +results of a function during function signature conversion. In the example +above, it unwraps the tuple type and converts the type of each constituent. + +For this purpose, users can provide custom callbacks by using +`addDecomposeTypeConversion` for the `BufferizeTypeConverter` instance to show +how a specific type (i.e. TupleType) can be unpacked. However, when a type +decomposition is provided, there are two additional callbacks that have to be +defined as well. They specify how to pack or unpack values of that particular +type. These two callbacks can be provided by the `addArgumentMaterialization` +(packing) **and** `addDecomposeValueConversion` (unpacking) functions: + +The following piece of code demonstrates this functionality by flattening a +`TupleType`. + +```mlir +converter.addDecomposeTypeConversion( + [](TupleType tupleType, SmallVectorImpl &types) { + tupleType.getFlattenedTypes(types); + return success(); + }); + +converter.addArgumentMaterialization( + [](OpBuilder &builder, TupleType resultType, ValueRange inputs, + Location loc) -> Optional { + if (inputs.size() == 1) + return llvm::None; + TypeRange TypeRange = inputs.getTypes(); + SmallVector types(TypeRange.begin(), TypeRange.end()); + TupleType tuple = TupleType::get(types, builder.getContext()); + mlir::Value value = builder.create(loc, tuple, inputs); + return value; + }); + +converter.addDecomposeValueConversion([](OpBuilder &builder, Location loc, + TupleType resultType, Value value, + SmallVectorImpl &values) { + for (unsigned i = 0, e = resultType.size(); i < e; ++i) { + Value res = builder.create( + loc, resultType.getType(i), value, builder.getI32IntegerAttr(i)); + values.push_back(res); + } + return success(); + }); +``` + +In the scope of these callback functions, the elements of a tuple value can be +decomposed using `GetTupleElementOp`. Conversely, `MakeTupleOp` is used to pack +a list of values as a single tuple type. + +### Bufferization Operation Conversion Patterns + +The following conversion patterns can be used to conveniently transform the +signature of a function, the return and call operations: + +* `BufferizeFuncOpConverter` +* `BufferizeReturnOpConverter` +* `BufferizeCallOpConverter` + +Any combination of these conversion patterns can be specified by the user. If +you need to apply all of these operation converters, you can use +`populateWithBufferizeOpConversionPatterns` which sets up all converters. + +### BufferizeFuncOpConverter + +The BufferizeFuncOpConverter is the actual function operation converter that +applies signature conversion by using a previously defined +`BufferizeTypeConverter`. + +In the following example, we configure a `BufferizeTypeConverter` instance such +that + +* all RankedTensorTypes should be converted to MemRefTypes. +* all function results that are results of type conversion from +RankedTensorTypes to MemRefTypes should be appended to the function argument +list. +* all TupleTypes should be flattened and decomposed to its constituents. + +```mlir +converter.addConversion([](RankedTensorType type) { + return (Type)MemRefType::get(type.getShape(), type.getElementType()); + }); +converter.setResultConversionKind( + BufferizeTypeConverter::AppendToArgumentsList); +converter.addDecomposeTypeConversion( + [](TupleType tupleType, SmallVectorImpl &types) { + tupleType.getFlattenedTypes(types); + return success(); + }); +``` + +Consider the following signature conversion: + +```mlir +func @on_tensors(%arg1: tuple) -> (tuple, tensor<5xf32>>){ + ... +} +``` + +Will be transformed to: + +```mlir +func @on_buffers(%arg0: i1, %arg1: f32, %out: memref<5xf32>) -> memref<10xf32> { + ... +} +``` + +Using the presented converter setup, all TupleType arguments and results are +decomposed first. The tensor<5xf32> result is converted to a memref<5xf32> type +and appended to the argument list. There is no conversion for the types memref, +i1, and f32. Therefore, the memref<10xf32> result is kept as it is and will +also be kept as a function result since there is no ResultConversionKind +mapping from a MemRefType to a MemRefType. However, if we change the +result-conversion behavior via + +```mlir +converter.setResultConversionKind( + BufferizeTypeConverter::KeepAsFunctionResult); +``` + +the output will be: + +```mlir +func @on_buffers(%arg0: i1, %arg1: f32) -> (memref<10xf32>,memref<5xf32>) { + ... +} +``` + +### BufferizeReturnOpConverter + +When changing the signature of a function, the return operands must match with +the results of the corresponding function if buffer-typed-results have been +configured to be appended to the function arguments list. This matching +consists of two separate steps. First, we have to remove the operands that have +been appended to the argument list as output arguments. Second, we have to +introduce additional copies for each operand. However, since each dialect has +its own dialect-dependent return and copy operations, this conversion pattern +comes with three template parameters which are the original return operation, +target return operation, and copy operation kinds. + +In the following example, two conversion patterns are inserted into the pattern +list. The `BufferizeReturnOpConverter` is set to replace a standard return +operation with the same operation type. + +```mlir +patterns->insert< + BufferizeFuncOpConverter, + BufferizeReturnOpConverter + + >(...) +``` + +Consider the following input/output program using a single return: + +```mlir +func @on_tensors(%arg0: tensor<5xf32>, %arg1: i1) -> (tensor<5xf32>, i1) { + return %arg0, %arg1 : tensor<5xf32>, i1 +} +``` + +Will be transformed to: + +```mlir +func @on_buffers(%arg0: memref<5xf32>, %arg1: i1, %out: memref<5xf32>) -> i1 { + linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> + return %arg1 : i1 +} +``` + +Based on our previously configured `BufferizeTypeConverter` instance which +requires buffer-typed-function-results to be appended to the function argument +list, the new `on_buffers` function signature is created. The operands of the +return operation must be adapted with the new function signature. Therefore, +the buffer-typed operand is removed from the operand list of the new return +operation. Instead, a copy operation is inserted right before the return +operation to copy the content of the operand buffer to the target buffer and +yields the output as shown above. + +### BufferizeCallOpConverter + +The BufferizeCallOpConverter is a call operation converter that transforms and +matches the operands and results of a call operation with the arguments and +results of the callee. Besides converting operand and result types, it +allocates a buffer for each buffer-based result of the called function that is +appended to the argument list (if buffer typed results have been configured to +be appended to the function arguments list). + +The following piece of code shows a sample call site, based on our previously +configured `BufferizeTypeConversion`: + +```mlir +func @callee(%arg0: tensor<5xf32>) -> (tensor<5xf32>) { + return %arg0 : tensor<5xf32> +} + +func @caller(%arg0: tensor<5xf32>) -> tensor<5xf32> { + %x = call @callee(%arg0) : (tensor<5xf32>) -> tensor<5xf32> + return %x : tensor<5xf32> +} +``` + +Will be transformed to: + +```mlir +func @callee(%arg0: memref<5xf32>, %out: memref<5xf32>) { + linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> + return +} + +func @caller(%arg0: memref<5xf32>, %out: memref<5xf32>) { + %0 = alloc() : memref<5xf32> + call @callee(%arg0, %0) : (memref<5xf32>, memref<5xf32>) -> () + linalg.copy(%0, %out) : memref<5xf32>, memref<5xf32> + return +} +``` + +### Summarizing Example + +To summarize all preparation converters, the following sample is a complete +listing of an input IR program and its output after applying all converters: + +```mlir +func @callee(%arg0: tuple,i1>) -> tuple,i1> { + return %arg0 : tuple,i1> +} + +func @caller(%arg0: tuple,i1>) -> tuple,i1> { + %x = call @callee(%arg0) : (tuple,i1>) -> tuple,i1> + return %x : tuple,i1> +} +``` + +Will be transformed to: + +```mlir +func @callee(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { + %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple, i1> + %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple, i1>) -> memref<5xf32> + %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple, i1>) -> i1 + linalg.copy(%1, %arg2) : memref<5xf32>, memref<5xf32> + return %2 : i1 +} +func @caller(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { + %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple, i1> + %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple, i1>) -> memref<5xf32> + %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple, i1>) -> i1 + %3 = alloc() : memref<5xf32> + %4 = call @callee(%1, %2, %3) : (memref<5xf32>, i1, memref<5xf32>) -> i1 + %5 = "test.make_tuple"(%3, %4) : (memref<5xf32>, i1) -> tuple, i1> + %6 = "test.get_tuple_element"(%5) {index = 0 : i32} : (tuple, i1>) -> memref<5xf32> + %7 = "test.get_tuple_element"(%5) {index = 1 : i32} : (tuple, i1>) -> i1 + linalg.copy(%6, %arg2) : memref<5xf32>, memref<5xf32> + return %7 : i1 +} +``` + +## Buffer Deallocation - Internal Functionality + +This section covers the internal functionality of the BufferDeallocation +transformation. The transformation consists of several passes. The main pass +called BufferDeallocation can be applied via “-buffer-deallocation” on MLIR +programs. Currently, there are three optimization passes, that moves allocs and +converts AllocOps in AllocaOps, if possible. The first and second pass can be +applied using “-buffer-hoisting” or “-buffer-loop-hoisting”, the third one +using “-promote-buffers-to-stack”. However, these optimizations must be applied +before using the BufferDeallocation pass. This design decision was made to +simplify the maintenance and the development efforts to realize optimizations +reasoning about buffers as it avoids reasoning about the actual buffer +deallocation places and potential temporary copies that have to be introduced. + +### Requirements + +In order to use BufferDeallocation on an arbitrary dialect, several +control-flow interfaces have to be implemented when using custom operations. +This is particularly important to understand the implicit control-flow +dependencies between different parts of the input program. Without implementing +the following interfaces, control-flow relations cannot be discovered properly +and the resulting program can become invalid: + +* Branch-like terminators should implement the `BranchOpInterface` to query and +manipulate associated operands. +* Operations involving structured control flow have to implement the +`RegionBranchOpInterface` to model inter-region control flow. +* Terminators yielding values to their parent operation (in particular in the +scope of nested regions within `RegionBranchOpInterface`-based operations), +should implement the `ReturnLike` trait to represent logical “value returns”. + +Example dialects that are fully compatible are the “std” and “scf” dialects +with respect to all implemented interfaces. This does not include any rewriter +patterns. + +### Detection of Buffer Allocations + +The first step of the BufferDeallocation transformation is to identify +manageable allocations operations that implement the `SideEffects` interface. +Furthermore, these ops need to apply the effect `MemoryEffects::Allocate` to a +particular result value while not using the resource +`SideEffects::AutomaticAllocationScopeResource` (since it is currently reserved +for allocations, like `Alloca` that will be automatically deallocated by a +parent scope). Allocations that have not been detected in this phase will not +be tracked internally, and thus, not deallocated automatically. However, +BufferDeallocation is fully compatible with “hybrid” setups in which tracked +and untracked allocations are mixed: + +```mlir +func @mixedAllocation(%arg0: i1) { + %0 = alloca() : memref<2xf32> // aliases: %2 + %1 = alloc() : memref<2xf32> // aliases: %2 + cond_br %arg0, ^bb1, ^bb2 +^bb1: + use(%0) + br ^bb3(%0 : memref<2xf32>) +^bb2: + use(%1) + br ^bb3(%1 : memref<2xf32>) +^bb3(%2: memref<2xf32>): + ... +} +``` + +Example of using a conditional branch with alloc and alloca. BufferDeallocation +can detect and handle the different allocation types that might be intermixed. + +Note: the current version does not support allocation operations returning +multiple result buffers. + +### Conversion from AllocOp to AllocaOp + +The PromoteBuffersToStack-pass converts AllocOps to AllocaOps, if possible. In +some cases, it can be useful to use such stack-based buffers instead of +heap-based buffers. The conversion is restricted to several constraints like: + +* Control flow +* Buffer Size +* Dynamic Size + +If a buffer is leaving a block, we are not allowed to convert it into an +alloca. If the size of the buffer is large, we could convert it, but regarding +stack overflow, it makes sense to limit the size of these buffers and only +convert small ones. The size can be set via a pass option. The current default +value is 1KB. Furthermore, we can not convert buffers with dynamic size, since +the dimension is not known a priori. + +### Movement and Placement of Allocations + +Using the buffer hoisting pass, all buffer allocations are moved as far upwards +as possible in order to group them and make upcoming optimizations easier by +limiting the search space. Such a movement is shown in the following graphs. +In addition, we are able to statically free an alloc, if we move it into a +dominator of all of its uses. This simplifies further optimizations (e.g. +buffer fusion) in the future. However, movement of allocations is limited by +external data dependencies (in particular in the case of allocations of +dynamically shaped types). Furthermore, allocations can be moved out of nested +regions, if necessary. In order to move allocations to valid locations with +respect to their uses only, we leverage Liveness information. + +The following code snippets shows a conditional branch before running the +BufferHoisting pass: + +![branch_example_pre_move](/includes/img/branch_example_pre_move.svg) + +```mlir +func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { + cond_br %arg0, ^bb1, ^bb2 +^bb1: + br ^bb3(%arg1 : memref<2xf32>) +^bb2: + %0 = alloc() : memref<2xf32> // aliases: %1 + use(%0) + br ^bb3(%0 : memref<2xf32>) +^bb3(%1: memref<2xf32>): // %1 could be %0 or %arg1 + "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () + return +} +``` + +Applying the BufferHoisting pass on this program results in the following piece +of code: + +![branch_example_post_move](/includes/img/branch_example_post_move.svg) + +```mlir +func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { + %0 = alloc() : memref<2xf32> // moved to bb0 + cond_br %arg0, ^bb1, ^bb2 +^bb1: + br ^bb3(%arg1 : memref<2xf32>) +^bb2: + use(%0) + br ^bb3(%0 : memref<2xf32>) +^bb3(%1: memref<2xf32>): + "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () + return +} +``` + +The alloc is moved from bb2 to the beginning and it is passed as an argument to +bb3. + +The following example demonstrates an allocation using dynamically shaped +types. Due to the data dependency of the allocation to %0, we cannot move the +allocation out of bb2 in this case: + +```mlir +func @condBranchDynamicType( + %arg0: i1, + %arg1: memref, + %arg2: memref, + %arg3: index) { + cond_br %arg0, ^bb1, ^bb2(%arg3: index) +^bb1: + br ^bb3(%arg1 : memref) +^bb2(%0: index): + %1 = alloc(%0) : memref // cannot be moved upwards to the data + // dependency to %0 + use(%1) + br ^bb3(%1 : memref) +^bb3(%2: memref): + "linalg.copy"(%2, %arg2) : (memref, memref) -> () + return +} +``` + +### Introduction of Copies + +In order to guarantee that all allocated buffers are freed properly, we have to +pay attention to the control flow and all potential aliases a buffer allocation +can have. Since not all allocations can be safely freed with respect to their +aliases (see the following code snippet), it is often required to introduce +copies to eliminate them. Consider the following example in which the +allocations have already been placed: + +```mlir +func @branch(%arg0: i1) { + %0 = alloc() : memref<2xf32> // aliases: %2 + cond_br %arg0, ^bb1, ^bb2 +^bb1: + %1 = alloc() : memref<2xf32> // resides here for demonstration purposes + // aliases: %2 + br ^bb3(%1 : memref<2xf32>) +^bb2: + use(%0) + br ^bb3(%0 : memref<2xf32>) +^bb3(%2: memref<2xf32>): + … + return +} +``` + +Example of using branches before introducing copies for all allocations. + +The first alloc can be safely freed after the live range of its post-dominator +block (bb3). The alloc in bb1 has an alias %2 in bb3 that also keeps this +buffer alive until the end of bb3. Since we cannot determine the actual +branches that will be taken at runtime, we have to ensure that all buffers are +freed correctly in bb3 regardless of the branches we will take to reach the +exit block. This makes it necessary to introduce a copy for %2, which allows us +to free %alloc0 in bb0 and %alloc1 in bb1. Afterwards, we can continue +processing all aliases of %2 (none in this case) and we can safely free %2 at +the end of the sample program. This sample demonstrates that not all +allocations can be safely freed in their associated post-dominator blocks. +Instead, we have to pay attention to all of their aliases. + +Applying the BufferDeallocation pass to the program above yields the following +result: + +```mlir +func @branch(%arg0: i1) { + %0 = alloc() : memref<2xf32> + cond_br %arg0, ^bb1, ^bb2 +^bb1: + %1 = alloc() : memref<2xf32> + %3 = alloc() : memref<2xf32> // temp copy for %1 + "linalg.copy"(%1, %3) : (memref<2xf32>, memref<2xf32>) -> () + dealloc %1 : memref<2xf32> // %1 can be safely freed here + br ^bb3(%3 : memref<2xf32>) +^bb2: + use(%0) + %4 = alloc() : memref<2xf32> // temp copy for %0 + "linalg.copy"(%0, %4) : (memref<2xf32>, memref<2xf32>) -> () + br ^bb3(%4 : memref<2xf32>) +^bb3(%2: memref<2xf32>): + … + dealloc %2 : memref<2xf32> // free temp buffer %2 + dealloc %0 : memref<2xf32> // %0 can be safely freed here + return +} +``` + +Note that a temporary buffer for %2 was introduced to free all allocations +properly. Note further that the unnecessary allocation of %3 can be easily +removed using one of the post-pass transformations. + +Reconsider the previously introduced sample demonstrating dynamically shaped +types: + +```mlir +func @condBranchDynamicType( + %arg0: i1, + %arg1: memref, + %arg2: memref, + %arg3: index) { + cond_br %arg0, ^bb1, ^bb2(%arg3: index) +^bb1: + br ^bb3(%arg1 : memref) +^bb2(%0: index): + %1 = alloc(%0) : memref // aliases: %2 + use(%1) + br ^bb3(%1 : memref) +^bb3(%2: memref): + "linalg.copy"(%2, %arg2) : (memref, memref) -> () + return +} +``` + +In the presence of DSTs, we have to parameterize the allocations with +additional dimension information of the source buffers, we want to copy from. +BufferDeallocation automatically introduces all required operations to extract +dimension specifications and wires them with the associated allocations: + +```mlir +func @condBranchDynamicType( + %arg0: i1, + %arg1: memref, + %arg2: memref, + %arg3: index) { + cond_br %arg0, ^bb1, ^bb2(%arg3 : index) +^bb1: + %c0 = constant 0 : index + %0 = dim %arg1, %c0 : memref // dimension operation to parameterize + // the following temp allocation + %1 = alloc(%0) : memref + "linalg.copy"(%arg1, %1) : (memref, memref) -> () + br ^bb3(%1 : memref) +^bb2(%2: index): + %3 = alloc(%2) : memref + use(%3) + %c0_0 = constant 0 : index + %4 = dim %3, %c0_0 : memref // dimension operation to parameterize + // the following temp allocation + %5 = alloc(%4) : memref + "linalg.copy"(%3, %5) : (memref, memref) -> () + dealloc %3 : memref // %3 can be safely freed here + br ^bb3(%5 : memref) +^bb3(%6: memref): + "linalg.copy"(%6, %arg2) : (memref, memref) -> () + dealloc %6 : memref // %6 can be safely freed here + return +} +``` + +BufferDeallocation performs a fix-point iteration taking all aliases of all +tracked allocations into account. We initialize the general iteration process +using all tracked allocations and their associated aliases. As soon as we +encounter an alias that is not properly dominated by our allocation, we mark +this alias as _critical_ (needs to be freed and tracked by the internal +fix-point iteration). The following sample demonstrates the presence of +critical and non-critical aliases: + +![nested_branch_example_pre_move](/includes/img/nested_branch_example_pre_move.svg) + +```mlir +func @condBranchDynamicTypeNested( + %arg0: i1, + %arg1: memref, // aliases: %3, %4 + %arg2: memref, + %arg3: index) { + cond_br %arg0, ^bb1, ^bb2(%arg3: index) +^bb1: + br ^bb6(%arg1 : memref) +^bb2(%0: index): + %1 = alloc(%0) : memref // cannot be moved upwards due to the data + // dependency to %0 + // aliases: %2, %3, %4 + use(%1) + cond_br %arg0, ^bb3, ^bb4 +^bb3: + br ^bb5(%1 : memref) +^bb4: + br ^bb5(%1 : memref) +^bb5(%2: memref): // non-crit. alias of %1, since %1 dominates %2 + br ^bb6(%2 : memref) +^bb6(%3: memref): // crit. alias of %arg1 and %2 (in other words %1) + br ^bb7(%3 : memref) +^bb7(%4: memref): // non-crit. alias of %3, since %3 dominates %4 + "linalg.copy"(%4, %arg2) : (memref, memref) -> () + return +} +``` + +Applying BufferDeallocation yields the following output: + +![nested_branch_example_post_move](/includes/img/nested_branch_example_post_move.svg) + +```mlir +func @condBranchDynamicTypeNested( + %arg0: i1, + %arg1: memref, + %arg2: memref, + %arg3: index) { + cond_br %arg0, ^bb1, ^bb2(%arg3 : index) +^bb1: + %c0 = constant 0 : index + %d0 = dim %arg1, %c0 : memref + %5 = alloc(%d0) : memref // temp buffer required due to alias %3 + "linalg.copy"(%arg1, %5) : (memref, memref) -> () + br ^bb6(%5 : memref) +^bb2(%0: index): + %1 = alloc(%0) : memref + use(%1) + cond_br %arg0, ^bb3, ^bb4 +^bb3: + br ^bb5(%1 : memref) +^bb4: + br ^bb5(%1 : memref) +^bb5(%2: memref): + %c0_0 = constant 0 : index + %d1 = dim %2, %c0_0 : memref + %6 = alloc(%d1) : memref // temp buffer required due to alias %3 + "linalg.copy"(%1, %6) : (memref, memref) -> () + dealloc %1 : memref + br ^bb6(%6 : memref) +^bb6(%3: memref): + br ^bb7(%3 : memref) +^bb7(%4: memref): + "linalg.copy"(%4, %arg2) : (memref, memref) -> () + dealloc %3 : memref // free %3, since %4 is a non-crit. alias of %3 + return +} +``` + +Since %3 is a critical alias, BufferDeallocation introduces an additional +temporary copy in all predecessor blocks. %3 has an additional (non-critical) +alias %4 that extends the live range until the end of bb7. Therefore, we can +free %3 after its last use, while taking all aliases into account. Note that %4 + does not need to be freed, since we did not introduce a copy for it. + +The actual introduction of buffer copies is done after the fix-point iteration +has been terminated and all critical aliases have been detected. Thereby a +critical alias can be either a block argument or another value that is returned +by an operation. Copies for block arguments are handled by analyzing all +predecessor blocks. This is primarily done by querying the `BranchOpInterface` +of the associated branch terminators that can jump to the current block. +Consider the following example which involves a simple branch and the critical +block argument %2: + +```mlir + custom.br ^bb1(..., %0, : ...) + ... + custom.br ^bb1(..., %1, : ...) + ... +^bb1(%2: memref<2xf32>): + ... +``` + +The `BranchOpInterface` allows us to determine the actual values that will be +passed to block bb1 and its argument %2 by analyzing its predecessor blocks. +Once we have resolved the values %0 and %1 (that are associated with %2 in this +sample), we can introduce a temporary buffer and clone its contents into the +new buffer. Afterwards, we rewire the branch operands to use the newly +allocated buffer instead. However, blocks can have implicitly defined +predecessors by parent ops that implement the `RegionBranchOpInterface`. This +can be the case if this block argument belongs to the entry block of a region. +In this setting, we have to identify all predecessor regions defined by the +parent operation. For every region, we need to get all terminator operations +implementing the `ReturnLike` trait, indicating that they can branch to our +current block. Finally, we can use a similar functionality as described above +to add the temporary copy. This time, we can modify the terminator operands +directly without touching a high-level interface. + +Consider the following inner-region control-flow sample that uses an imaginary +“custom.region_if” operation. It either executes the “then” or “else” region +and always continues to the “join” region. Thereby, the +“custom.region_if_yield” operation returns a result to the parent operation. +This sample demonstrates the use of the `RegionBranchOpInterface` to determine +predecessors in order to infer the high-level control flow: + +```mlir +func @inner_region_control_flow( + %arg0 : index, + %arg1 : index) -> memref { + %0 = alloc(%arg0, %arg0) : memref + %1 = custom.region_if %0 : memref -> (memref) + then(%arg2 : memref) { // aliases: %arg4, %1 + custom.region_if_yield %arg2 : memref + } else(%arg3 : memref) { // aliases: %arg4, %1 + custom.region_if_yield %arg3 : memref + } join(%arg4 : memref) { // aliases: %1 + custom.region_if_yield %arg4 : memref + } + return %1 : memref +} +``` + +![region_branch_example_pre_move](/includes/img/region_branch_example_pre_move.svg) + +Non-block arguments (other values) can become aliases when they are returned by +dialect-specific operations. BufferDeallocation supports this behavior via the +`RegionBranchOpInterface`. Consider the following example that uses an “scf.if” +operation to determine the value of %2 at runtime which creates an alias: + +```mlir +func @nested_region_control_flow(%arg0 : index, %arg1 : index) -> memref { + %0 = cmpi "eq", %arg0, %arg1 : index + %1 = alloc(%arg0, %arg0) : memref + %2 = scf.if %0 -> (memref) { + scf.yield %1 : memref // %2 will be an alias of %1 + } else { + %3 = alloc(%arg0, %arg1) : memref // nested allocation in a div. + // branch + use(%3) + scf.yield %1 : memref // %2 will be an alias of %1 + } + return %2 : memref +} +``` + +In this example, a dealloc is inserted to release the buffer within the else +block since it cannot be accessed by the remainder of the program. Accessing +the `RegionBranchOpInterface`, allows us to infer that %2 is a non-critical +alias of %1 which does not need to be tracked. + +```mlir +func @nested_region_control_flow(%arg0: index, %arg1: index) -> memref { + %0 = cmpi "eq", %arg0, %arg1 : index + %1 = alloc(%arg0, %arg0) : memref + %2 = scf.if %0 -> (memref) { + scf.yield %1 : memref + } else { + %3 = alloc(%arg0, %arg1) : memref + use(%3) + dealloc %3 : memref // %3 can be safely freed here + scf.yield %1 : memref + } + return %2 : memref +} +``` + +Analogous to the previous case, we have to detect all terminator operations in +all attached regions of “scf.if” that provides a value to its parent operation +(in this sample via scf.yield). Querying the `RegionBranchOpInterface` allows +us to determine the regions that “return” a result to their parent operation. +Like before, we have to update all `ReturnLike` terminators as described above. +Reconsider a slightly adapted version of the “custom.region_if” example from +above that uses a nested allocation: + +```mlir +func @inner_region_control_flow_div( + %arg0 : index, + %arg1 : index) -> memref { + %0 = alloc(%arg0, %arg0) : memref + %1 = custom.region_if %0 : memref -> (memref) + then(%arg2 : memref) { // aliases: %arg4, %1 + custom.region_if_yield %arg2 : memref + } else(%arg3 : memref) { + %2 = alloc(%arg0, %arg1) : memref // aliases: %arg4, %1 + custom.region_if_yield %2 : memref + } join(%arg4 : memref) { // aliases: %1 + custom.region_if_yield %arg4 : memref + } + return %1 : memref +} +``` + +Since the allocation %2 happens in a divergent branch and cannot be safely +deallocated in a post-dominator, %arg4 will be considered a critical alias. +Furthermore, %arg4 is returned to its parent operation and has an alias %1. +This causes BufferDeallocation to introduce additional copies: + +```mlir +func @inner_region_control_flow_div( + %arg0 : index, + %arg1 : index) -> memref { + %0 = alloc(%arg0, %arg0) : memref + %1 = custom.region_if %0 : memref -> (memref) + then(%arg2 : memref) { + %c0 = constant 0 : index // determine dimension extents for temp allocation + %2 = dim %arg2, %c0 : memref + %c1 = constant 1 : index + %3 = dim %arg2, %c1 : memref + %4 = alloc(%2, %3) : memref // temp buffer required due to critic. + // alias %arg4 + linalg.copy(%arg2, %4) : memref, memref + custom.region_if_yield %4 : memref + } else(%arg3 : memref) { + %2 = alloc(%arg0, %arg1) : memref + %c0 = constant 0 : index // determine dimension extents for temp allocation + %3 = dim %2, %c0 : memref + %c1 = constant 1 : index + %4 = dim %2, %c1 : memref + %5 = alloc(%3, %4) : memref // temp buffer required due to critic. + // alias %arg4 + linalg.copy(%2, %5) : memref, memref + dealloc %2 : memref + custom.region_if_yield %5 : memref + } join(%arg4: memref) { + %c0 = constant 0 : index // determine dimension extents for temp allocation + %2 = dim %arg4, %c0 : memref + %c1 = constant 1 : index + %3 = dim %arg4, %c1 : memref + %4 = alloc(%2, %3) : memref // this allocation will be removed by + // applying the copy removal pass + linalg.copy(%arg4, %4) : memref, memref + dealloc %arg4 : memref + custom.region_if_yield %4 : memref + } + dealloc %0 : memref // %0 can be safely freed here + return %1 : memref +} +``` + +### Placement of Deallocs + +After introducing allocs and copies, deallocs have to be placed to free +allocated memory and avoid memory leaks. The deallocation needs to take place +after the last use of the given value. The position can be determined by +calculating the common post-dominator of all values using their remaining +non-critical aliases. A special-case is the presence of back edges: since such +edges can cause memory leaks when a newly allocated buffer flows back to +another part of the program. In these cases, we need to free the associated +buffer instances from the previous iteration by inserting additional deallocs. + +Consider the following “scf.for” use case containing a nested structured +control-flow if: + +```mlir +func @loop_nested_if( + %lb: index, + %ub: index, + %step: index, + %buf: memref<2xf32>, + %res: memref<2xf32>) { + %0 = scf.for %i = %lb to %ub step %step + iter_args(%iterBuf = %buf) -> memref<2xf32> { + %1 = cmpi "eq", %i, %ub : index + %2 = scf.if %1 -> (memref<2xf32>) { + %3 = alloc() : memref<2xf32> // makes %2 a critical alias due to a + // divergent allocation + use(%3) + scf.yield %3 : memref<2xf32> + } else { + scf.yield %iterBuf : memref<2xf32> + } + scf.yield %2 : memref<2xf32> + } + "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () + return +} +``` + +In this example, the _then_ branch of the nested “scf.if” operation returns a +newly allocated buffer. + +Since this allocation happens in the scope of a divergent branch, %2 becomes a +critical alias that needs to be handled. As before, we have to insert +additional copies to eliminate this alias using copies of %3 and %iterBuf. This +guarantees that %2 will be a newly allocated buffer that is returned in each +iteration. However, “returning” %2 to its alias %iterBuf turns %iterBuf into a +critical alias as well. In other words, we have to create a copy of %2 to pass +it to %iterBuf. Since this jump represents a back edge, and %2 will always be a +new buffer, we have to free the buffer from the previous iteration to avoid +memory leaks: + +```mlir +func @loop_nested_if( + %lb: index, + %ub: index, + %step: index, + %buf: memref<2xf32>, + %res: memref<2xf32>) { + %4 = alloc() : memref<2xf32> + "linalg.copy"(%buf, %4) : (memref<2xf32>, memref<2xf32>) -> () + %0 = scf.for %i = %lb to %ub step %step + iter_args(%iterBuf = %4) -> memref<2xf32> { + %1 = cmpi "eq", %i, %ub : index + %2 = scf.if %1 -> (memref<2xf32>) { + %3 = alloc() : memref<2xf32> // makes %2 a critical alias + use(%3) + %5 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 + "linalg.copy"(%3, %5) : memref<2xf32>, memref<2xf32> + dealloc %3 : memref<2xf32> + scf.yield %5 : memref<2xf32> + } else { + %6 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 + "linalg.copy"(%iterBuf, %6) : memref<2xf32>, memref<2xf32> + scf.yield %6 : memref<2xf32> + } + %7 = alloc() : memref<2xf32> // temp copy due to crit. alias %iterBuf + "linalg.copy"(%2, %7) : memref<2xf32>, memref<2xf32> + dealloc %2 : memref<2xf32> + dealloc %iterBuf : memref<2xf32> // free backedge iteration variable + scf.yield %7 : memref<2xf32> + } + "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () + dealloc %0 : memref<2xf32> // free temp copy %0 + return +} +``` + +Example for loop-like control flow. The CFG contains back edges that have to be +handled to avoid memory leaks. The bufferization is able to free the backedge +iteration variable %iterBuf. + +### Private Analyses Implementations + +The BufferDeallocation transformation relies on one primary control-flow +analysis: BufferPlacementAliasAnalysis. Furthermore, we also use dominance and +liveness to place and move nodes. The liveness analysis determines the live +range of a given value. Within this range, a value is alive and can or will be +used in the course of the program. After this range, the value is dead and can +be discarded - in our case, the buffer can be freed. To place the allocs, we +need to know from which position a value will be alive. The allocs have to be +placed in front of this position. However, the most important analysis is the +alias analysis that is needed to introduce copies and to place all +deallocations. + +## Post Phase + +In order to limit the complexity of the BufferDeallocation transformation, some +tiny code-polishing/optimization transformations are not applied on-the-fly +during placement. Currently, there is the CopyRemoval transformation only to +remove unnecessary copy and allocation operations. + +Note: further transformations might be added to the post-pass phase in the +future. + +### CopyRemoval Pass + +A common pattern that arises during placement is the introduction of +unnecessary temporary copies that are used instead of the original source +buffer. For this reason, there is a post-pass transformation that removes these +allocations and copies via `-copy-removal`. This pass, besides removing +unnecessary copy operations, will also remove the dead allocations and their +corresponding deallocation operations. The CopyRemoval pass can currently be +applied to operations that implement the `CopyOpInterface` in any of these two +situations which are + +* reusing the source buffer of the copy operation. +* reusing the target buffer of the copy operation. + +### Reusing the Source Buffer of the Copy Operation + +In this case, the source of the copy operation can be used instead of target. +The unused allocation and deallocation operations that are defined for this +copy operation are also removed. Here is a working example generated by the +BufferDeallocation pass that allocates a buffer with dynamic size. A deeper +analysis of this sample reveals that the highlighted operations are redundant +and can be removed. + +```mlir +func @dynamic_allocation(%arg0: index, %arg1: index) -> memref { + %7 = alloc(%arg0, %arg1) : memref + %c0_0 = constant 0 : index + %8 = dim %7, %c0_0 : memref + %c1_1 = constant 1 : index + %9 = dim %7, %c1_1 : memref + %10 = alloc(%8, %9) : memref + linalg.copy(%7, %10) : memref, memref + dealloc %7 : memref + return %10 : memref +} +``` + +Will be transformed to: + +```mlir +func @dynamic_allocation(%arg0: index, %arg1: index) -> memref { + %7 = alloc(%arg0, %arg1) : memref + %c0_0 = constant 0 : index + %8 = dim %7, %c0_0 : memref + %c1_1 = constant 1 : index + %9 = dim %7, %c1_1 : memref + return %7 : memref +} +``` + +In this case, the additional copy %10 can be replaced with its original source +buffer %7. This also applies to the associated dealloc operation of %7. + +To limit the complexity of this transformation, it only removes copy operations +when the following constraints are met: + +* The copy operation, the defining operation for the target value, and the +deallocation of the source value lie in the same block. +* There are no users/aliases of the target value between the defining operation +of the target value and its copy operation. +* There are no users/aliases of the source value between its associated copy +operation and the deallocation of the source value. + +### Reusing the Target Buffer of the Copy Operation + +In this case, the target buffer of the copy operation can be used instead of +its source. The unused allocation and deallocation operations that are defined +for this copy operation are also removed. + +Consider the following example where a generic linalg operation writes the +result to %temp and then copies %temp to %result. However, these two operations +can be merged into a single step. Copy removal removes the copy operation and +%temp, and replaces the uses of %temp with %result: + +```mlir +func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ + %temp = alloc() : memref<2xf32> + linalg.generic { + args_in = 1 : i64, + args_out = 1 : i64, + indexing_maps = [#map0, #map0], + iterator_types = ["parallel"]} %arg0, %temp { + ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): + %tmp2 = exp %gen2_arg0 : f32 + linalg.yield %tmp2 : f32 + }: memref<2xf32>, memref<2xf32> + "linalg.copy"(%temp, %result) : (memref<2xf32>, memref<2xf32>) -> () + dealloc %temp : memref<2xf32> + return +} +``` + +Will be transformed to: + +```mlir +func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ + linalg.generic { + args_in = 1 : i64, + args_out = 1 : i64, + indexing_maps = [#map0, #map0], + iterator_types = ["parallel"]} %arg0, %result { + ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): + %tmp2 = exp %gen2_arg0 : f32 + linalg.yield %tmp2 : f32 + }: memref<2xf32>, memref<2xf32> + return +} +``` + +Like before, several constraints to use the transformation apply: + +* The copy operation, the defining operation of the source value, and the +deallocation of the source value lie in the same block. +* There are no users/aliases of the target value between the defining operation +of the source value and the copy operation. +* There are no users/aliases of the source value between the copy operation and +the deallocation of the source value. + +## Known Limitations + +BufferDeallocation introduces additional copies using allocations from the +“std” dialect (“std.alloc”). Analogous, all deallocations use the “std” +dialect-free operation “std.dealloc”. The actual copy process is realized using +“linalg.copy”. diff --git a/mlir/docs/includes/img/branch_example_post_move.svg b/mlir/docs/includes/img/branch_example_post_move.svg new file mode 100644 --- /dev/null +++ b/mlir/docs/includes/img/branch_example_post_move.svg @@ -0,0 +1,419 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + in: %arg0, %arg1, %arg2 + bb0 + + bb2 + bb1 + + bb3 (%1) + +use(%0) +br bb3(%0) + + copy (%1, arg2) + %0 + %arg1 + %0 = alloc() + + diff --git a/mlir/docs/includes/img/branch_example_pre_move.svg b/mlir/docs/includes/img/branch_example_pre_move.svg new file mode 100644 --- /dev/null +++ b/mlir/docs/includes/img/branch_example_pre_move.svg @@ -0,0 +1,409 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + in: %arg0, %arg1, %arg2 + bb0 + + bb2 + bb1 + + bb3 (%1) + %0 = alloc() +use(%0) +br bb3(%0) + + copy (%1, arg2) + %0 + %arg1 + + diff --git a/mlir/docs/includes/img/nested_branch_example_post_move.svg b/mlir/docs/includes/img/nested_branch_example_post_move.svg new file mode 100644 --- /dev/null +++ b/mlir/docs/includes/img/nested_branch_example_post_move.svg @@ -0,0 +1,759 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + in: %arg0, %arg1, %arg2, %arg3 + bb0 + + bb1 + + + %5 + + + + + bb2 (%0) + bb4 + bb3 + + bb5 (%2) + %1 + + bb7 (%4) + + + + + + + + + bb6 (%3) + %1 + %6 + + %1 = alloc(%0)use(%1) + + copy(%4, %arg2)dealloc %3 + %3 + %5 = alloc(%d0)copy(%arg1, %5) + %6 = alloc(%d1)copy(%1, %6)dealloc %1 + + diff --git a/mlir/docs/includes/img/nested_branch_example_pre_move.svg b/mlir/docs/includes/img/nested_branch_example_pre_move.svg new file mode 100644 --- /dev/null +++ b/mlir/docs/includes/img/nested_branch_example_pre_move.svg @@ -0,0 +1,717 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + in: %arg0, %arg1, %arg2, %arg3 + bb0 + + bb1 + + + %arg1 + + + + + bb2 (%0) + bb4 + bb3 + + bb5 (%2) + %1 + + bb7 (%4) + + + + + + + + + bb6 (%3) + %1 + %2 + + %1 = alloc(%0)use(%0) + + copy(%4, %arg2) + %3 + + diff --git a/mlir/docs/includes/img/region_branch_example_pre_move.svg b/mlir/docs/includes/img/region_branch_example_pre_move.svg new file mode 100644 --- /dev/null +++ b/mlir/docs/includes/img/region_branch_example_pre_move.svg @@ -0,0 +1,435 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + %0 + if + + then + %1 + else + + join + %arg4 + %arg4 + %arg2 + %arg3 + + +