When vectorizing loops that operate on values that start narrower and are extended in the loop, we don't maximize the vector throughput and overall do a poor job of vectorizing.
Example:
double test(float *__restrict thing1, float *__restrict thing2) { int i = 0; double aggr_prod = 0.0; for (i = 0; i < 300; i++) { aggr_prod += (thing1[i] * thing2[i]); } return aggr_prod; }
We will currently only vectorize this by a factor of 2, then extend early and perform FMA's for the computation. However, it is much faster to:
- Vectorize by a factor of 4
- Perform the multiplication in single precision
- Extend the result of the multiplication and do the addition
This patch improves performance of an important kernel by 50% which in turn provides a very significant improvement on the benchmark that contains the kernel. It also does not have a detrimental effect on performance of other benchmarks as measured by SPEC results.