Bit of a brain dump because I was seeing the same problems with addressing modes in unrolled loops and is completely related to what @SjoerdMeijer is currently working on in D89693 and I doubt I will have time to look more into this...
For the benchmark that I am looking at, the total size shrinks, but there seems to be a problem because we no longer generate the LDPs, (which I presume this is just a current limitation of the AArch64LoadStoreOptimizer?):
< ldp q0, q2, [x2, #-16] < ldp q1, q3, [x4, #-16] < subs x5, x5, #8 // =8 < add x4, x4, #32 // =32 < add x2, x2, #32 // =32 < fmul v0.4s, v0.4s, v1.4s < fmul v2.4s, v2.4s, v3.4s < ldp q1, q3, [x3, #-16] < fadd v0.4s, v1.4s, v0.4s < fadd v1.4s, v3.4s, v2.4s < stp q0, q1, [x3, #-16] < add x3, x3, #32 // =32 --- > ldr q0, [x5, #32]! > subs x27, x27, #8 // =8 > ldur q1, [x5, #-16] > ldr q2, [x7, #32]! > ldur q3, [x7, #-16] > ldr q4, [x6, #32]! > fmul v0.4s, v0.4s, v2.4s > fmul v1.4s, v1.4s, v3.4s > ldr q2, [x6, #16] > fadd v1.4s, v4.4s, v1.4s > fadd v0.4s, v2.4s, v0.4s > stp q1, q0, [x6]
clang-format: please reformat the code