The loop vectoriser is sandwiched between two loopunroll invocations in the optimisation pipeline and this removes the first one. The motivation is that (fully) unrolling loops early removes opportunities for the loop vectoriser. This is often the case for loops with constant loop bounds and relatively small iteration counts for which vectorisation is still very much profitable. After fully unrolling these sort of loops, the SLP vectoriser is not always able to compensate for this, or is not (yet) as effective as the loop vectoriser. Therefore, first performing loop vectorisation, unrolling, SLP vectorisation seems a better approach.
There are a quite a few of these cases in x264 in SPEC, like this one which GCC loop vectorises and we don't which is the reason why we are behind quite a lot:
for( int i = 0; i < 16; i++ ) if ((dct[i]) > 0 ) dct[i] = (bias[i] + (dct[i])) * (mf[i]) >> 16; else dct[i] = - ((bias[i] - (dct[i])) * (mf[i]) >> 16);
But this is also a bit of an old problem, and at least the following PRs are related: PR47178, PR47726, PR47554, PR47436, PR31572, PR47553, PR47491.
Some first performance numbers with patch, where + is a performance improvement and - is a regression:
AArch64 (neoverse-n1) 500.perlbench_r +0.34% 502.gcc_r i -0.28% 505.mcf_r -0.60% 520.omnetpp_r +0.585 523.xalancbmk_r +1.68% 525.x264_r +1.33% 531.deepsjeng_r +0.29% 541.leela_r -0.54% 557.xz_r 0.00%
And for some embedded benchmarks:
Thumb2 (Cortex-M55) benchmark1 -0.21% benchmark2 +0.06% DSP +0.02%
These numbers show and improvement where I would like to see it: x264. The uplift in xalancbmk is nice too, but I haven't analysed that one yet. The other numbers are show a little bit of up and down behaviour, but only very small, and overall cancelling out each other. I think these are really encouraging results, because it suggests we get the results where we want without any fallout. This picture was confirmed on a set of embedded benchmarks.
I am not really a fan of the llvm test suite as a performance benchmark (noisy), but will get some numbers for that too. And while I do that, and fix up a few llvm regression test cases (the ones that check optimisation pipeline order), I already wanted to share this to get some opinions on this.