Index: flang/docs/fstack-arrays.md
===================================================================
--- /dev/null
+++ flang/docs/fstack-arrays.md
@@ -0,0 +1,152 @@
+# -fstack-arrays
+## Problem Description
+Other Fortran compilers have options to ensure that arrays are allocated on the
+stack. These options are enabled along with `-Ofast`.
+
+## Proposed Solution
+One approach would be to write a pass which attempts to move heap allocations to
+the stack (like the memory-allocation-opt pass in reverse). This approach has
+the advantage of a self-contained implementation and ensuring that newly added
+cases are covered.
+
+Another approach would be to modify each place in which arrays are allocated on
+the heap to instead generate a stack allocation.
+
+The second approach is preferable because it would be difficult to match up all
+`fir.allocmem` operations with their associated `fir.freemem`. In general, the
+lifetimes of heap allocated memory are not constrained by the current function's
+stack frame and so cannot be always converted to stack allocations. It is much
+easier to swap heap allocations for stack allocations when they are first
+generated because the lifetime information is conveniently available.
+
+For example, to rewrite the heap allocation in the array-value-copy pass with
+a stack allocation using the first approach would require analysis to ensure
+that the heap allocation is always freed before the function returns. This is
+much more complex than never generating a heap allocation (and free) in the
+first place (the second approach).
+
+## Implementation details overview
+Flang allocates most arrays on the stack by default, but there are a few cases
+where temporary arrays are allocated on the heap:
+- `flang/lib/Optimizer/Transforms/ArrayValueCopy.cpp`
+- `flang/lib/Optimizer/Transforms/MemoryAllocation.cpp`
+- `flang/lib/Lower/ConvertExpr.cpp`
+- `flang/lib/Lower/IntrinsicCall.cpp`
+- `flang/lib/Lower/ConvertVariable.cpp`
+
+Lowering code is being updated and in the future, temporaries for expressions
+will be created in the HLFIR bufferization pass in
+`flang/lib/Optimizer/HLFIR/Trnasforms/BufferizeHLFIR.cpp`.
+
+### `ArrayValueCopy.cpp`
+Memory is allocated for a temporary array in `allocateArrayTemp()`. This should
+be changed so that if an option is set requesting stack arrays,
+`allocateArrayTemp()` will generate `fir.alloca` instead of `fir.allocmem` and
+the clean up code will not include a `fir.freemem`. The `fir.alloca` needs to be
+placed in the function entry block so that array value copy operations inside a
+loop do not lead to repeated large stack allocations.
+
+For example, see this diff comparing the output of the array value copy pass
+with and without stack-arrays enabled (SSA values adjusted manually to avoid
+noise):
+
+```
+--- /tmp/stack-arrays-false.fir	2022-12-08 17:34:26.606124055 +0000
++++ /tmp/stack-arrays-true.fir	2022-12-08 17:42:11.730699123 +0000
+@@ -1,5 +1,6 @@
+ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<i8, dense<[8, 32]> : vector<2xi32>>, #dlti.dl_entry<i16, dense<[16, 32]> : vector<2xi32>>, #dlti.dl_entry<i64, dense<64> : vector<2xi32>>, #dlti.dl_entry<i128, dense<128> : vector<2xi32>>, #dlti.dl_entry<i1, dense<8> : vector<2xi32>>, #dlti.dl_entry<i32, dense<32> : vector<2xi32>>, #dlti.dl_entry<f16, dense<16> : vector<2xi32>>, #dlti.dl_entry<f64, dense<64> : vector<2xi32>>, #dlti.dl_entry<f128, dense<128> : vector<2xi32>>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", llvm.target_triple = "aarch64-unknown-linux-gnu"} {
+   func.func @_QPsb(%arg0: !fir.ref<!fir.array<5xi32>> {fir.bindc_name = "x"}) {
++    %10 = fir.alloca !fir.array<5xi32>
+     %c5 = arith.constant 5 : index
+     %c3_i64 = arith.constant 3 : i64
+     %0 = fir.convert %c3_i64 : (i64) -> index
+@@ -15,7 +16,6 @@
+     %7 = arith.select %6, %5, %c0 : index
+     %8 = fir.shape %c5 : (index) -> !fir.shape<1>
+     %9 = fir.slice %0, %2, %1 : (index, index, index) -> !fir.slice<1>
+-    %10 = fir.allocmem !fir.array<5xi32>
+     %11 = fir.convert %c5 : (index) -> index
+     %c0_0 = arith.constant 0 : index
+     %c1 = arith.constant 1 : index
+@@ -26,7 +26,7 @@
+       %25 = fir.array_coor %arg0(%8) %24 : (!fir.ref<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+       %c1_8 = arith.constant 1 : index
+       %26 = arith.addi %arg1, %c1_8 : index
+-      %27 = fir.array_coor %10(%8) %26 : (!fir.heap<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
++      %27 = fir.array_coor %10(%8) %26 : (!fir.ref<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+       %28 = fir.load %25 : !fir.ref<i32>
+       fir.store %28 to %27 : !fir.ref<i32>
+     }
+@@ -50,7 +50,7 @@
+       %26 = fir.load %25 : !fir.ref<i32>
+       %c1_8 = arith.constant 1 : index
+       %27 = arith.addi %arg1, %c1_8 : index
+-      %28 = fir.array_coor %10(%8) [%9] %27 : (!fir.heap<!fir.array<5xi32>>, !fir.shape<1>, !fir.slice<1>, index) -> !fir.ref<i32>
++      %28 = fir.array_coor %10(%8) [%9] %27 : (!fir.ref<!fir.array<5xi32>>, !fir.shape<1>, !fir.slice<1>, index) -> !fir.ref<i32>
+       fir.store %26 to %28 : !fir.ref<i32>
+       fir.result %13 : !fir.array<5xi32>
+     }
+@@ -61,14 +61,13 @@
+     fir.do_loop %arg1 = %c0_5 to %23 step %c1_6 {
+       %c1_7 = arith.constant 1 : index
+       %24 = arith.addi %arg1, %c1_7 : index
+-      %25 = fir.array_coor %10(%8) %24 : (!fir.heap<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
++      %25 = fir.array_coor %99(%8) %24 : (!fir.ref<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+       %c1_8 = arith.constant 1 : index
+       %26 = arith.addi %arg1, %c1_8 : index
+       %27 = fir.array_coor %arg0(%8) %26 : (!fir.ref<!fir.array<5xi32>>, !fir.shape<1>, index) -> !fir.ref<i32>
+       %28 = fir.load %25 : !fir.ref<i32>
+       fir.store %28 to %27 : !fir.ref<i32>
+     }
+-    fir.freemem %10 : !fir.heap<!fir.array<5xi32>>
+     return
+   }
+ }
+```
+
+### `MemoryAllocation.cpp`
+The default options for the Memory Allocation transformation ensure that no
+array allocations, no matter how large, are moved from the stack to the heap.
+
+### `ConvertExpr.cpp`
+`ConvertExpr.cpp` allocates many array temporaries on the heap:
+  - Lowering mask expressions
+  - Passing array arguments by value or when they need re-shaping
+  - Lowering elemental array expressions
+  - Array initialization
+
+These will need addressing individually. The details are TODO.
+
+### `IntrinsicCall.cpp`
+The existing design is for the runtime to do the allocation and the lowering
+code to insert `fir.freemem` to remove the allocation. It is not clear whether
+this can be replaced by adapting lowering code to do stack allocations and to
+pass these to the runtime. This would be a significant change and so is out of
+scope of `-fstack-arrays`.
+
+### `ConvertVariable.cpp`
+See `Fortran::lower::MapSymbolAttributes`:
+
+Dummy arguments that are not declared in the current entry point require a
+skeleton definition. Most such "unused" dummies will not survive into final
+generated code, but some will. It is illegal to reference one at run time if it
+does. There are three ways these dummies can be mapped to a value:
+- a `fir::UndefOp` value: preferable but can't be used for dummies with dynamic
+  bounds or used to define another dummy.
+- A stack slot. This has intermediate-weight cost but may not be usable for
+  objects with dynamic bounds.
+- A heap box/descriptor is the heaviest weight option, only used for dynamic
+  objects.
+
+These heap allocations will be out of scope for `-fstack-arrays` because the
+existing implementation already uses stack allocations (or better) where
+possible, and most such dummy arguments will be removed from generated code.
+
+### `BufferizeHLFIR.cpp`
+TODO
+
+## Testing Plan
+Testing will duplicate the existing tests for all modified code, keeping one
+version of the test checking the default behaviour and adding a new run checking
+that the correct changes are observed when the stack arrays option is enabled.
+This will likely take the form of alternative `FileCheck` prefixes for each mode.