This patch adds on an overhead cost for gathers and scatters, which
is a rough estimate based on performance investigations I have
performed on SVE hardware for various micro-benchmarks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
10 sounds high, but from looking at some of the software optimization guides it does not seem like a bad worse case value.
Is it worth making it a option (that can default to 10), to allow experimentation with other values?
I guess that might be useful?
From the benchmarks I ran for tight loops with a single gather or scatter with a variety of different strides the vector loops with SVE gathers and scatters overall didn't perform better than a scalar loop! Even worse, in loops with a high density of gathers and scatters the runtime can even be 2x-3x worse than a scalar loop. The optimisation guides do suggest there is a very low throughput.
There's probably more follow-up work to fine-tune this, but the patch seems like a slam-dunk win overall, as this patch significantly reduces any regressions we found when running this on a series of benchmarks.
I like @dmgreen's suggestion to make this a tuneable option because it makes it easier to experiment with other g/s costs.
LGTM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1786 | nit: this one-line function can be inlined. |
Yeah sounds OK to me. Like I said in another commit (I think), we may find in the long run that we can optimize the codegen to be better, and certain cpus will be better or worse than other, but this sounds OK to me as a starting point.
nit: this one-line function can be inlined.