Consider the following loop:
void foo(float *dst, float *src, int N) { for (int i = 0; i < N; i++) { dst[i] = 0.0; for (int j = 0; j < N; j++) { dst[i] += src[(i * N) + j]; } } }
When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.
Test added here:
Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
I don't think I fully understand why disabling interleaving is more profitable than having it enabled when VF=1, but I think you empirically found that having a UF=1 when VF>1 leads to regressions when enabling strict reductions.
This means that with this patch enabling strict reductions by default will no longer lead to regressions, whereas without strict reductions enabled, this loop would not have been vectorized or interleaved in the first place. So this is purely limiting the scope of strict-reductions to avoid regressions.
That approach sounds sensible to me.