Currently LV can choose sub-optimal vectorization factors for loops with
memory accesses using different widths. At the moment, the largest type
limits the vectorization factor, but this is overly pessimistic on some
targets, which have memory instructions that require a certain minimum
VF for operations on narrow types.
The motivating example is AArch64, which requires a larger VFs for
vectorization to be profitable when narrow types are involved.
Currently code like below is not vectorized on AArch64, because the
chosen max VF of 4 (because the largest type is i32) is not profitable
(due to to type extensions).
int foo(unsigned char *len, unsigned size) { int maxLen = 0; int minLen = 0; for (unsigned i = 0; i < size; i++) { if (len[i] > maxLen) maxLen = len[i]; if (len[i] < minLen) minLen = len[i]; } return maxLen + minLen; }
This patch addresses this issue by detecting cases where memory ops for
the narrowest type are more expensive than with larger VFs. For such
cases, it instead considers larger vectorization factors, limited by
estimated register usage. Loops like the above can be speed-up by ~4x
on AArch64.
This change should not introduce regressions; we only explore more
vectorization factors, but the cost model still picks the most
profitable one.
The impact on SPEC2000 & SPEC2006 is relatively small:
Tests: 31 Same hash: 18 (filtered out) Remaining: 13 Metric: loop-vectorize.LoopsVectorized test-suite...T2000/300.twolf/300.twolf.test 18.00 23.00 27.8% test-suite...T2000/256.bzip2/256.bzip2.test 12.00 14.00 16.7% test-suite...T2006/401.bzip2/401.bzip2.test 15.00 17.00 13.3% test-suite...T2006/445.gobmk/445.gobmk.test 25.00 27.00 8.0% test-suite...0/253.perlbmk/253.perlbmk.test 32.00 34.00 6.2% test-suite...000/186.crafty/186.crafty.test 19.00 20.00 5.3% test-suite...0.perlbench/400.perlbench.test 38.00 40.00 5.3% test-suite...T2006/456.hmmer/456.hmmer.test 63.00 65.00 3.2% test-suite...6/482.sphinx3/482.sphinx3.test 64.00 66.00 3.1% test-suite.../CINT2000/176.gcc/176.gcc.test 43.00 44.00 2.3% test-suite.../CINT2006/403.gcc/403.gcc.test 97.00 98.00 1.0% test-suite...3.xalancbmk/483.xalancbmk.test 271.00 273.00 0.7% test-suite...6/464.h264ref/464.h264ref.test 79.00 79.00 0.0%
There are a few small runtime improvements.
I also verified the changes to the vectorized loops in 300.twolf, 401.bzip2
& 445.gobmk. All changed loops are loops that the patch targets.
nit: Are you planning to handle more special cases like this in the future? If so, then it may be worth moving this to its own shouldMaximizeBandwidth function.