Set the maximum vscale VF of AArch64 with 128 / the size of smallest type in
loop if there is no register usage overflow, which is similar to the neon VF
done on D118979.
Details
Diff Detail
Unit Tests
Event Timeline
For Neon we enabled shouldMaximizeVectorBandwidth so that the backend could make use of instructions like umull/umull2 and the narrowing instructions. Extending into larger types for Neon is quite natural in places, and can lead to less total instructions. SVE has instructions like UMULLB/T that work on the top/bottom lanes in a pair, but I don't believe the backend makes any use of them at the moment.
The description is a bit light on details. What is the reasoning behind enabling this for SVE too? And do you have any benchmark results?
I don't have a server with SVE to support run the performance of large benchmark spec2017.
But when I run the Lammp with intel mode (https://www.lammps.org/#gsc.tab=0) on emulator, I find the
hot function PairLJCutCoulLongIntel::eval in file pair_lj_cut_coul_long_intel.cpp:337 will enlarge the VF from 2 to 4
because there are float and double types in the kernel loop body, so choose a more widen VF will have
wider parallelism, and the performance gain about 16% (https://github.com/lammps/lammps/blob/develop/src/INTEL/pair_lj_cut_coul_long_intel.cpp#L337).
For the record - In SVE2 there are a number of instructions that can use top/bottom lanes providing the backend does some sort of lane interleaving. Once that is done this might make a lot of sense but it might be better to address that first.