diff --git a/flang/docs/fstack-arrays.md b/flang/docs/fstack-arrays.md new file mode 100644 --- /dev/null +++ b/flang/docs/fstack-arrays.md @@ -0,0 +1,192 @@ +# -fstack-arrays +## Problem Description +In gfortran, `-fstack-arrays` will cause all local arrays, including those of +unknown size, to be allocated from stack memory. Gfortran enables this flag by +default at `-Ofast`. + +Flang already allocates all local arrays on the stack (the +`memory-allocation-opt` pass can move large allocations to the heap, but by +default it will not). But there are some cases where temporary arrays are +created on the heap by Flang. Moving these temporary arrays is the goal here. + +## Proposed Solution +One approach would be to write a pass which attempts to move heap allocations to +the stack (like the `memory-allocation-opt` pass in reverse). This approach has +the advantage of a self-contained implementation and ensuring that newly added +cases are covered. + +Another approach would be to modify each place in which arrays are allocated on +the heap to instead generate a stack allocation. The advantage of the second +approach is that it would be difficult to match up all `fir.allocmem` operations +with their associated `fir.freemem`. In general, the lifetimes of heap allocated +memory are not constrained by the current function's stack frame and so cannot +be always converted to stack allocations. It is much easier to swap heap +allocations for stack allocations when they are first generated because the +lifetime information is conveniently available. + +For example, to rewrite the heap allocation in the `array-value-copy` pass with +a stack allocation using the first approach would require analysis to ensure +that the heap allocation is always freed before the function returns. This is +much more complex than never generating a heap allocation (and free) in the +first place (the second approach). + +The plan is to take the more complex first approach so that newly added changes +to lowering code do not need to be made to support the stack arrays option. The +general problem of determining heap allocation lifetimes can be simplified in +this case because only those allocations which are always freed before exit from +the same function are possible to be moved to heap allocations in that +function's stack frame. Furthermore, the aim of this pass would be to only move +those allocations generated by Flang, and so some of the more difficult cases +can be avoided. Where the transformation is not possible, the existing heap +allocation will be kept. + +## Implementation details overview +In order to limit the complexity of the proposed pass, it is useful to +understand the situations in which Flang will generate heap allocations. + +### Known Heap Array Allocations +Flang allocates most arrays on the stack by default, but there are a few cases +where temporary arrays are allocated on the heap: +- `flang/lib/Optimizer/Transforms/ArrayValueCopy.cpp` +- `flang/lib/Optimizer/Transforms/MemoryAllocation.cpp` +- `flang/lib/Lower/ConvertExpr.cpp` +- `flang/lib/Lower/IntrinsicCall.cpp` +- `flang/lib/Lower/ConvertVariable.cpp` + +Lowering code is being updated and in the future, temporaries for expressions +will be created in the HLFIR bufferization pass in +`flang/lib/Optimizer/HLFIR/Trnasforms/BufferizeHLFIR.cpp`. + +#### `ArrayValueCopy.cpp` +Memory is allocated for a temporary array in `allocateArrayTemp()`. This +temporary array is used to ensure that assignments of one array to itself +produce the required value. E.g. + +``` +integer, dimension(5), intent(inout) :: x +x(3,4) = x(1,2) +``` + +#### `MemoryAllocation.cpp` +The default options for the Memory Allocation transformation ensure that no +array allocations, no matter how large, are moved from the stack to the heap. + +#### `ConvertExpr.cpp` +`ConvertExpr.cpp` allocates many array temporaries on the heap: + - Passing array arguments by value or when they need re-shaping + - Lowering elemental array expressions + - Lowering mask expressions + - Array constructors + +The last two of these cases are **not** covered by the current stack arrays pass +design. + +The FIR code generated for mask expressions (the WHERE construct) sets a +boolean variable to indicate whether a heap allocation was necessary. The +allocation is only freed if the variable indicates that the allocation was +performed to begin with. The proposed dataflow analysis is not intelligent +enough to statically determine that the boolean variable will always be true +when the allocation is performed. Beyond this, the control flow in the generated +FIR code passes the allocated memory through `fir.result`, resulting in a +different SSA value to be allocated and freed, causing the analysis not to +realise that the allocated memory is freed. The most convenient solution here +would be to generate less complicated FIR code, as the existing codegen has +known bugs: https://github.com/llvm/llvm-project/issues/56921, +https://github.com/llvm/llvm-project/issues/59803. + +Code generated for array constructors uses `realloc()` to grow the allocated +buffer because the size of the resulting array cannot always be determined +ahead of running the constructor. This makes this temporary unsuitable +for allocation on the stack. + +#### `IntrinsicCall.cpp` +The existing design is for the runtime to do the allocation and the lowering +code to insert `fir.freemem` to remove the allocation. It is not clear whether +this can be replaced by adapting lowering code to do stack allocations and to +pass these to the runtime. This would be a significant change and so is out of +scope of `-fstack-arrays`. + +#### `ConvertVariable.cpp` +See `Fortran::lower::MapSymbolAttributes`: + +Dummy arguments that are not declared in the current entry point require a +skeleton definition. Most such "unused" dummies will not survive into final +generated code, but some will. It is illegal to reference one at run time if it +does. There are three ways these dummies can be mapped to a value: +- a `fir::UndefOp` value: preferable but can't be used for dummies with dynamic + bounds or used to define another dummy. +- A stack slot. This has intermediate-weight cost but may not be usable for + objects with dynamic bounds. +- A heap box/descriptor is the heaviest weight option, only used for dynamic + objects. + +These heap allocations will be out of scope for `-fstack-arrays` because the +existing implementation already uses stack allocations (or better) where +possible, and most such dummy arguments will be removed from generated code. + +#### `BufferizeHLFIR.cpp` +TODO + +### Detecting Allocations to Move +Allocations which could be moved to the stack will be detected by performing a +forward dense data flow analysis using `mlir::dataflow::DenseDataFlowAnalysis`. +This analysis will search for SSA values created by a `fir.allocmem` which are +always freed using `fir.freemem` within the same function. + +Tracking allocations by SSA values will miss some cases where address +calculations are duplicated in different blocks: resulting in different SSA +values as arguments for the allocation and the free. It is believed that simple +tracking of SSA values is sufficient for detecting the allocations for array +temporaries added by Flang, because these temporaries should be simple arrays: +not nested inside of derived types or other arrays. + +Arrays allocated by an `allocate` statement in source code should not be moved +to the stack. These will be identified by adding an attribute to these +`fir.allocmem` operations when they are lowered. Intrinsics such as `allocated` +and `move_alloc` do not need to be supported because the array temporaries moved +to the stack will never be visible in user code. + +Only allocations for arrays will be considered for moving to the stack. + +When conducting the dense data flow analysis, the pass will assume that existing +allocations are not freed inside called functions. This is true for the existing +heap array temporary allocations generated by Flang. If an allocation were to be +freed inside of a called function, the allocation would presumably not also be +freed later in the caller function (as this would be a double-free). Therefore +the stack arrays pass would not find where the allocation was freed and so not +transform the allocation. It is necessary to make this assumption so that the +stack arrays pass can transform array allocations created for pass-by-value +function arguments. + +### Moving Allocations +The `fir.allocmem` will be replaced by a `fir.alloca` with the same arguments. +The `fir.freemem` will be removed entirely. + +Where possible, the `fir.alloca` should be placed in the function entry block +(so we can be sure it only happens once). This may not be possible if the array +has non-constant extents (causing the `fir.alloca` to have SSA values as +operands). In this case, the `fir.alloca` will be placed immediately after the +last operand becomes available. + +If this location is inside a loop (either an `mlir::LoopLikeOpInterface` or a +cyclic CFG), the transformation should attempt to use the `llvm.stacksave`/ +`llvm.stackrestore` intrinsics to ensure that the stack does not grow on every +loop iteration. Use of these intrinsics is considered valid when the allocation +and deallocation occur within the same block and there are no other stack +allocations between them. If this is not the case, the existing heap allocation +will be preserved. + +### FIR attribute +A FIR attribute will be added to distinguish `fir.allocmem` arising from +`allocate` statements in source from `fir.allocmem` operations added by Flang. +The attribute will be called `"fir.must_be_heap"` and will have a boolean value: +`true` meaning that the stack arrays pass must not move the allocation, `false` +meaning that stack arrays may move the allocation. Not specifying the attribute +will be equivalent to setting it to `false`. + +## Testing Plan +FileCheck tests will be written to check each of the above identified sources of +heap allocated array temporaries are detected and converted by the new pass. + +Another test will check that `allocate` statements in source code will not be +moved to the stack.