Set the maximum VF of AArch64 with 128 / the size of smallest type in loop.
The performance improvement from benchmarks is as below.
SPEC2017 Benchmark Improvement(%) 500.perlbench_r -0.44372 502.gcc_r 0.11339 505.mcf_r -0.36421 520.omnetpp_r -0.12037 523.xalancbmk_r -0.55858 525.x264_r 0.390159 531.deepsjeng_r -0.02378 541.leela_r -0.01357 548.exchange2_r -0.00043 557.xz_r -0.17387
Overall improvement(%) on an internal benchmark 0.238949
Document K?