Index: flang/docs/fstack-arrays.md =================================================================== --- /dev/null +++ flang/docs/fstack-arrays.md @@ -0,0 +1,152 @@ +# -fstack-arrays +## Problem Description +Other Fortran compilers have options to ensure that arrays are allocated on the +stack. These options are enabled along with `-Ofast`. + +## Proposed Solution +One approach would be to write a pass which attempts to move heap allocations to +the stack (like the memory-allocation-opt pass in reverse). This approach has +the advantage of a self-contained implementation and ensuring that newly added +cases are covered. + +Another approach would be to modify each place in which arrays are allocated on +the heap to instead generate a stack allocation. + +The second approach is preferable because it would be difficult to match up all +`fir.allocmem` operations with their associated `fir.freemem`. In general, the +lifetimes of heap allocated memory are not constrained by the current function's +stack frame and so cannot be always converted to stack allocations. It is much +easier to swap heap allocations for stack allocations when they are first +generated because the lifetime information is conveniently available. + +For example, to rewrite the heap allocation in the array-value-copy pass with +a stack allocation using the first approach would require analysis to ensure +that the heap allocation is always freed before the function returns. This is +much more complex than never generating a heap allocation (and free) in the +first place (the second approach). + +## Implementation details overview +Flang allocates most arrays on the stack by default, but there are a few cases +where temporary arrays are allocated on the heap: +- `flang/lib/Optimizer/Transforms/ArrayValueCopy.cpp` +- `flang/lib/Optimizer/Transforms/MemoryAllocation.cpp` +- `flang/lib/Lower/ConvertExpr.cpp` +- `flang/lib/Lower/IntrinsicCall.cpp` +- `flang/lib/Lower/ConvertVariable.cpp` + +Lowering code is being updated and in the future, temporaries for expressions +will be created in the HLFIR bufferization pass in +`flang/lib/Optimizer/HLFIR/Trnasforms/BufferizeHLFIR.cpp`. + +### `ArrayValueCopy.cpp` +Memory is allocated for a temporary array in `allocateArrayTemp()`. This should +be changed so that if an option is set requesting stack arrays, +`allocateArrayTemp()` will generate `fir.alloca` instead of `fir.allocmem` and +the clean up code will not include a `fir.freemem`. The `fir.alloca` needs to be +placed in the function entry block so that array value copy operations inside a +loop do not lead to repeated large stack allocations. + +For example, see this diff comparing the output of the array value copy pass +with and without stack-arrays enabled (SSA values adjusted manually to avoid +noise): + +``` +--- /tmp/stack-arrays-false.fir 2022-12-08 17:34:26.606124055 +0000 ++++ /tmp/stack-arrays-true.fir 2022-12-08 17:42:11.730699123 +0000 +@@ -1,5 +1,6 @@ + module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>, #dlti.dl_entry : vector<2xi32>>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", llvm.target_triple = "aarch64-unknown-linux-gnu"} { + func.func @_QPsb(%arg0: !fir.ref> {fir.bindc_name = "x"}) { ++ %10 = fir.alloca !fir.array<5xi32> + %c5 = arith.constant 5 : index + %c3_i64 = arith.constant 3 : i64 + %0 = fir.convert %c3_i64 : (i64) -> index +@@ -15,7 +16,6 @@ + %7 = arith.select %6, %5, %c0 : index + %8 = fir.shape %c5 : (index) -> !fir.shape<1> + %9 = fir.slice %0, %2, %1 : (index, index, index) -> !fir.slice<1> +- %10 = fir.allocmem !fir.array<5xi32> + %11 = fir.convert %c5 : (index) -> index + %c0_0 = arith.constant 0 : index + %c1 = arith.constant 1 : index +@@ -26,7 +26,7 @@ + %25 = fir.array_coor %arg0(%8) %24 : (!fir.ref>, !fir.shape<1>, index) -> !fir.ref + %c1_8 = arith.constant 1 : index + %26 = arith.addi %arg1, %c1_8 : index +- %27 = fir.array_coor %10(%8) %26 : (!fir.heap>, !fir.shape<1>, index) -> !fir.ref ++ %27 = fir.array_coor %10(%8) %26 : (!fir.ref>, !fir.shape<1>, index) -> !fir.ref + %28 = fir.load %25 : !fir.ref + fir.store %28 to %27 : !fir.ref + } +@@ -50,7 +50,7 @@ + %26 = fir.load %25 : !fir.ref + %c1_8 = arith.constant 1 : index + %27 = arith.addi %arg1, %c1_8 : index +- %28 = fir.array_coor %10(%8) [%9] %27 : (!fir.heap>, !fir.shape<1>, !fir.slice<1>, index) -> !fir.ref ++ %28 = fir.array_coor %10(%8) [%9] %27 : (!fir.ref>, !fir.shape<1>, !fir.slice<1>, index) -> !fir.ref + fir.store %26 to %28 : !fir.ref + fir.result %13 : !fir.array<5xi32> + } +@@ -61,14 +61,13 @@ + fir.do_loop %arg1 = %c0_5 to %23 step %c1_6 { + %c1_7 = arith.constant 1 : index + %24 = arith.addi %arg1, %c1_7 : index +- %25 = fir.array_coor %10(%8) %24 : (!fir.heap>, !fir.shape<1>, index) -> !fir.ref ++ %25 = fir.array_coor %99(%8) %24 : (!fir.ref>, !fir.shape<1>, index) -> !fir.ref + %c1_8 = arith.constant 1 : index + %26 = arith.addi %arg1, %c1_8 : index + %27 = fir.array_coor %arg0(%8) %26 : (!fir.ref>, !fir.shape<1>, index) -> !fir.ref + %28 = fir.load %25 : !fir.ref + fir.store %28 to %27 : !fir.ref + } +- fir.freemem %10 : !fir.heap> + return + } + } +``` + +### `MemoryAllocation.cpp` +The default options for the Memory Allocation transformation ensure that no +array allocations, no matter how large, are moved from the stack to the heap. + +### `ConvertExpr.cpp` +`ConvertExpr.cpp` allocates many array temporaries on the heap: + - Lowering mask expressions + - Passing array arguments by value or when they need re-shaping + - Lowering elemental array expressions + - Array initialization + +These will need addressing individually. The details are TODO. + +### `IntrinsicCall.cpp` +The existing design is for the runtime to do the allocation and the lowering +code to insert `fir.freemem` to remove the allocation. It is not clear whether +this can be replaced by adapting lowering code to do stack allocations and to +pass these to the runtime. This would be a significant change and so is out of +scope of `-fstack-arrays`. + +### `ConvertVariable.cpp` +See `Fortran::lower::MapSymbolAttributes`: + +Dummy arguments that are not declared in the current entry point require a +skeleton definition. Most such "unused" dummies will not survive into final +generated code, but some will. It is illegal to reference one at run time if it +does. There are three ways these dummies can be mapped to a value: +- a `fir::UndefOp` value: preferable but can't be used for dummies with dynamic + bounds or used to define another dummy. +- A stack slot. This has intermediate-weight cost but may not be usable for + objects with dynamic bounds. +- A heap box/descriptor is the heaviest weight option, only used for dynamic + objects. + +These heap allocations will be out of scope for `-fstack-arrays` because the +existing implementation already uses stack allocations (or better) where +possible, and most such dummy arguments will be removed from generated code. + +### `BufferizeHLFIR.cpp` +TODO + +## Testing Plan +Testing will duplicate the existing tests for all modified code, keeping one +version of the test checking the default behaviour and adding a new run checking +that the correct changes are observed when the stack arrays option is enabled. +This will likely take the form of alternative `FileCheck` prefixes for each mode.