Hoisting and sinking instructions out of conditional blocks enables
additional vectorization by:
- Executing memory accesses unconditionally.
- Reducing the number of instructions that need predication.
After disabling early hoisting / sinking, we miss out on a few
vectorization opportunities. One of those is causing a ~10% performance
regression in one of the Geekbench benchmarks on AArch64.
This patch tires to recover the regression by running hoisting/sinking
inside each inner loop before vectorization. This is not ideal, because
we also hoist/sink in loops that won't be vectorized. But LV already
does similar transformations for all inner loops (e.g LoopSimplify and
LCSSA construction). Alternatively we could run a separate
loop-sink-hoist pass, but I am not sure that's worth the effort.
In the long term, the sinking/hoisting could and should be done in
VPlan, but it requires at least handling parts of legaltiy and
cost-modeling in VPlan as well.
Details about the impact on compile-time can be found here:
http://llvm-compile-time-tracker.com/compare.php?from=3a71d0de397e3a15c943ca59a00243ba8b7154da&to=c4efd69f4733b46e5de8fc2fa6e4c2495750d339&stat=instructions
NewPM-O3: geoman +0.18%
NewPM-ReleaseThinLTO: geoman +0.17%
NewPM-ReleaseLTO-g: geoman +0.18%
In terms of number of loops vectorized, we have the following changes
across MultiSource/SPEC2000/SPEC2006 on X86 with LTO
test-suite...000/186.crafty/186.crafty.test 20.00 22.00 10.0%
test-suite...006/450.soplex/450.soplex.test 85.00 86.00 1.2%
test-suite.../CINT2006/403.gcc/403.gcc.test 209.00 211.00 1.0%
test-suite...6/464.h264ref/464.h264ref.test 156.00 157.00 0.6%
test-suite...ications/JM/lencod/lencod.test 215.00 216.00 0.5%
And +0.5% more loops are vectorized in Geekbench on AArch64.
FWIW these sink/hoist helpers are fine with lazy updates.