This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Improve vectorization of loops that operate on values that are extended in the body
Changes PlannedPublic

Authored by nemanjai on Nov 8 2019, 10:14 AM.

Details

Reviewers
hfinkel
Group Reviewers
Restricted Project
Summary

When vectorizing loops that operate on values that start narrower and are extended in the loop, we don't maximize the vector throughput and overall do a poor job of vectorizing.
Example:

double test(float *__restrict thing1, float *__restrict thing2) {
  int i = 0;
  double aggr_prod = 0.0;

  for (i = 0; i < 300; i++) {
    aggr_prod += (thing1[i] * thing2[i]);
  }

  return aggr_prod;
}

We will currently only vectorize this by a factor of 2, then extend early and perform FMA's for the computation. However, it is much faster to:

  • Vectorize by a factor of 4
  • Perform the multiplication in single precision
  • Extend the result of the multiplication and do the addition

This patch improves performance of an important kernel by 50% which in turn provides a very significant improvement on the benchmark that contains the kernel. It also does not have a detrimental effect on performance of other benchmarks as measured by SPEC results.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai created this revision.Nov 8 2019, 10:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2019, 10:14 AM
nemanjai planned changes to this revision.Nov 19 2019, 5:20 AM

I need to add some additional pieces to this.