This patch merges STR<S,D,Q,W,X>pre-STR<S,D,Q,W,X>ui and LDR<S,D,Q,W,X>pre-LDR<S,D,Q,W,X>ui instruction pairs into a single STP<S,D,Q,W,X>pre and LDP<S,D,Q,W,X>pre instruction, respectively.
For each pair, there is a MIR test that verifies this optimization.
This was a missed opportunity in the AArch64 load/store optimiser for cases like this:
#define float32_t float #define uint32_t unsigned void test(float32_t * S, float32_t * D, uint32_t N) { for (uint32_t i = 0; i < N; i++) { D[i] = D[i] + S[i]; } }
When compiled with:
-Ofast -target aarch64-arm-none-eabi -mcpu=cortex-a55 -mllvm -lsr-preferred-addressing-mode=preindexed
It results in:
.LBB0_9: // =>This Inner Loop Header: Depth=1 ldr q0, [x11, #32]! ldr q1, [x11, #16] subs x12, x12, #8 // =8 ldr q2, [x10, #32]! ldr q3, [x10, #16] fadd v0.4s, v2.4s, v0.4s fadd v1.4s, v3.4s, v1.4s stp q0, q1, [x11] b.ne .LBB0_9
where:
ldr q0, [x11, #32]! ldr q1, [x11, #16]
should be:
ldp q0, q1, [x11, #32]!
Additionally for cases like:
define <4 x i32>* @strqpre-strqui-merge(<4 x i32>* %p, <4 x i32> %a, <4 x i32> %b) { entry: %p0 = getelementptr <4 x i32>, <4 x i32>* %p, i32 2 store <4 x i32> %a, <4 x i32>* %p0 %p1 = getelementptr <4 x i32>, <4 x i32>* %p, i32 3 store <4 x i32> %b, <4 x i32>* %p1 ret <4 x i32>* %p0 }
It results in:
"strqpre-strqui-merge": // @strqpre-strqui-merge str q0, [x0, #32]! str q1, [x0, #16] ret
where the store instruction should be merged with:
stp q0, q1, [x0, #32]!
This patch covers such cases including the various forms of STR<>pre/LDR<>pre.
I feel like "Unscaled" instructions are a set of instructions on their own right. Can we rename the function to something like hasUnscaledLdStOffset to make it clear what it means now?