Promotion of conditional accesses breaks the safety property, which divides into two parts:
- The memory may not be dereferencable on loop entry. In this case, we cannot hoist load instructions into the preheader basic block.
- The memory model does not allow us to insert a store along any dynamic path which did not originally have one.
The LICM pass, as of D113289, hoists load instructions which are not guaranteed to execute into the preheader basic block, if the memory proves to be dereferencable on loop entry. Note that this change does not sink the store instructions which are not guaranteed to execute into the exit blocks.
Sinking the store instructions, which are not guaranteed to execute, directly breaks the second part of the safety property mentioned above. To be more precise, sinking conditional store instructions may introduce data races/traps. In order to make it thread-safe, we need to make sure the sunken store instructions are executed conditionally, depending on whether or not the value was written to.
Consider the following source code:
int u; void f(int a[restrict], int n) { for (int i = 0; i < n; ++i) if (a[i]) ++u; }
This patch allows the LICM pass, in order for it to promote the conditional access of u (which would include hoisting the load instruction and sinking the store instruction), to insert a simple flag, which is initially (in the preheader) down. Whenever the control flow gets to the conditional store to u, the flag is raised. Finally, the sunken store instruction is executed conditionally (using the llvm.masked.store intrinsic) in the exit blocks, depending on whether or not the flag was raised.
For the source code given above, the following pseudo LLVM IR code (only the relevant parts are shown) corresponds to the actual LLVM IR code the LICM, now, generates:
... entry: u.promoted = load u br for.cond for.cond: u.flag = phi [ 0, entry ], [ u.flag.next, for.inc ] ; Note that the flag is down, if we get here through the preheader basic block. inc = phi [ u.promoted, entry ], [ inc.next, for.inc ] ... for.body: ... if.then: inc.if.then = add inc, 1 br for.inc for.inc: u.flag.next = phi [ 1, if.then ], [ u.flag, for.body ] ; Note that the flag is raised, if we get here through the if.then basic block. inc.next = phi [ inc.if.then, if.then ], [ inc, for.body ] ... for.cond.cleanup: ; This is the only exit block. u.flag.lcssa = phi [ u.flag, for.cond ] ; Get the flag value. inc.lcssa = phi [ inc, for.cond ] call void @llvm.masked.store(<1 x i32> [inc.lcssa], <1 x i32>* [u], i32 <alignment>, <1 x i1> [u.flag.lcssa]) ret void } ...
This patch addresses proposed potential optimizations from: Missed opportunities for register promotion.
IIUC, this should be target-independent optimization? And, GCC generates such code for both x86_64 and aarch64, right?
The generated code will be bigger, so this will impact the code size, so I guess this needs motivation in terms of (SPEC) benchmarking.