For tight loops like this:
float r = 0; for (int i = 0; i < n; i++) { r += a[i]; }
it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.
I don't know if we need to talk about this in terms of scheduling exactly - that will be very dependent on the cpu used. Perhaps just describe it in terms of "extra overheads on some cpus"