This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][Analysis] Add on overhead costs for SVE gathers and scatters
ClosedPublic

Authored by david-arm on Dec 6 2021, 3:34 AM.

Details

Summary

This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.

Diff Detail

Event Timeline

david-arm created this revision.Dec 6 2021, 3:34 AM
david-arm requested review of this revision.Dec 6 2021, 3:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2021, 3:34 AM
david-arm retitled this revision from [AArch64][Analysis] Add on overhead costs for gathers and scatters to [AArch64][Analysis] Add on overhead costs for SVE gathers and scatters.
david-arm updated this revision to Diff 392300.Dec 7 2021, 12:55 AM
  • Tweaked the overhead based on revised micro-benchmark measurements.

10 sounds high, but from looking at some of the software optimization guides it does not seem like a bad worse case value.

Is it worth making it a option (that can default to 10), to allow experimentation with other values?

10 sounds high, but from looking at some of the software optimization guides it does not seem like a bad worse case value.

Is it worth making it a option (that can default to 10), to allow experimentation with other values?

I guess that might be useful?

From the benchmarks I ran for tight loops with a single gather or scatter with a variety of different strides the vector loops with SVE gathers and scatters overall didn't perform better than a scalar loop! Even worse, in loops with a high density of gathers and scatters the runtime can even be 2x-3x worse than a scalar loop. The optimisation guides do suggest there is a very low throughput.

There's probably more follow-up work to fine-tune this, but the patch seems like a slam-dunk win overall, as this patch significantly reduces any regressions we found when running this on a series of benchmarks.
I like @dmgreen's suggestion to make this a tuneable option because it makes it easier to experiment with other g/s costs.

david-arm updated this revision to Diff 393149.Dec 9 2021, 7:00 AM
  • Rebased.
  • Added command line options to change the SVE gather/scatter overheads.
sdesmalen accepted this revision.Dec 9 2021, 7:04 AM

LGTM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1786

nit: this one-line function can be inlined.

This revision is now accepted and ready to land.Dec 9 2021, 7:04 AM

Yeah sounds OK to me. Like I said in another commit (I think), we may find in the long run that we can optimize the codegen to be better, and certain cpus will be better or worse than other, but this sounds OK to me as a starting point.