[SLP] Enable 64-bit wide vectorization on AArch64

ARM Neon has native support for half-sized vector registers (64 bits). This
is beneficial for example for 2D and 3D graphics. This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.

  • Performance Analysis

This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.

The results are with -O3 and PGO. A negative percentage is an improvement.
The testsuite was run with a sample size of 4.

  • SPEC
  • CFP2006/482.sphinx3 -3.34%

A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.

  • CFP2000/177.mesa -3.34%
  • CINT2000/256.bzip2 +6.97%

My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%. There are also other problems with the codegen in
this loop so there is further room for improvement.

  • LLVM testsuite
  • SingleSource/Benchmarks/Misc/ReedSolomon -10.75%

There are multiple small SLP vectorizations outside the hot code. It's a bit
surprising that it adds up to 10%. Some of this may be code-layout noise.

  • MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%

The opt-viewer screenshot can be seen at F3218284. We start at a colder store
but the tree leads us into the hottest loop.

  • MultiSource/Applications/lambda-0.1.3/lambda -2.68%
  • MultiSource/Benchmarks/Bullet/bullet -2.18%

This is using 3D vectors.

  • SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%

Noise, binary is unchanged.

  • MultiSource/Benchmarks/Ptrdist/anagram/anagram +4.90%

There is an additional SLP in the cold code. The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.

  • MultiSource/Applications/aha/aha +1.63%
  • MultiSource/Applications/JM/lencod/lencod +1.41%
  • SingleSource/Benchmarks/Misc/richards_benchmark +1.15%

Differential Revision: https://reviews.llvm.org/D31965


