This is an archive of the discontinued LLVM Phabricator instance.

[Analysis][AArch64] Make fixed-width ordered reductions slightly more expensive
ClosedPublic

Authored by david-arm on Aug 18 2021, 5:00 AM.

Details

Summary

For tight loops like this:

float r = 0;
for (int i = 0; i < n; i++) {
  r += a[i];
 }

it's better not to vectorise at -O3 using fixed-width ordered reductions
on AArch64 targets. Although the resulting number of instructions in the
generated code ends up being comparable to not vectorising at all, there
may be additional costs on some CPUs, for example perhaps the scheduling
is worse. It makes sense to deter vectorisation in tight loops.

Diff Detail

Event Timeline

david-arm created this revision.Aug 18 2021, 5:00 AM
david-arm requested review of this revision.Aug 18 2021, 5:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2021, 5:00 AM
dmgreen accepted this revision.Aug 18 2021, 5:41 AM

Yeah, this sounds sensible to me. We still vectorize when there starts to be a clear advantage of using other vector operations.
Looks good to me. Thanks.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2005–2007

I don't know if we need to talk about this in terms of scheduling exactly - that will be very dependent on the cpu used. Perhaps just describe it in terms of "extra overheads on some cpus"

This revision is now accepted and ready to land.Aug 18 2021, 5:41 AM
david-arm edited the summary of this revision. (Show Details)Aug 18 2021, 8:25 AM
david-arm added inline comments.Aug 18 2021, 9:02 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2005–2007

I've updated this comment in the commit!