Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.
The performance improvement from benchmarks is as below.
SPEC2017 Benchmark Improvement(%) 500.perlbench_r -0.44372 502.gcc_r 0.11339 505.mcf_r -0.36421 520.omnetpp_r -0.12037 523.xalancbmk_r -0.55858 525.x264_r 0.390159 531.deepsjeng_r -0.02378 541.leela_r -0.01357 548.exchange2_r -0.00043 557.xz_r -0.17387
Overall improvement(%) on an internal benchmark 0.238949
It's generally best if fixed length vectorization doesn't start behaving differently just because SVE is available (unless it can be better, of course). If we expect MaximizeVectorBandwidth to be better, but doesn't work for scalable vectors well, can we just try to disable the scalable VFs from being widened?