Page MenuHomePhabricator

[LoopVectorize] Don't interleave scalar ordered reductions for inner loops
ClosedPublic

Authored by david-arm on Jul 23 2021, 3:49 AM.

Details

Summary

Consider the following loop:

void foo(float *dst, float *src, int N) {
  for (int i = 0; i < N; i++) {
    dst[i] = 0.0;
    for (int j = 0; j < N; j++) {
      dst[i] += src[(i * N) + j];
    }
  }
}

When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.

Test added here:

Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll

Diff Detail

Event Timeline

david-arm created this revision.Jul 23 2021, 3:49 AM
david-arm requested review of this revision.Jul 23 2021, 3:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2021, 3:49 AM
sdesmalen added inline comments.Jul 26 2021, 7:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6478

I don't think I fully understand why disabling interleaving is more profitable than having it enabled when VF=1, but I think you empirically found that having a UF=1 when VF>1 leads to regressions when enabling strict reductions.

This means that with this patch enabling strict reductions by default will no longer lead to regressions, whereas without strict reductions enabled, this loop would not have been vectorized or interleaved in the first place. So this is purely limiting the scope of strict-reductions to avoid regressions.

That approach sounds sensible to me.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
7

This REQUIRES: asserts ?

16

is this needed for the test?

david-arm added inline comments.Jul 26 2021, 7:07 AM
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
7

Thanks, good spot @sdesmalen!

16

Probably not. It arises from the IR generated for the following C code where dst[i] is being initialised with memset:

void foo(float *dst, float *src, int N) {
  for (int i = 0; i < N; i++) {
    dst[i] = 0.0;
    for (int j = 0; j < N; j++) {
      dst[i] += src[(i * N) + j];
    }
  }
}

Without the memset we're just adding to the initial value for dst[i]

david-arm updated this revision to Diff 362013.Jul 27 2021, 6:50 AM
  • Addressed review comments
david-arm marked 3 inline comments as done.Jul 27 2021, 6:50 AM
sdesmalen accepted this revision.Jul 27 2021, 6:55 AM

LGTM, cheers @david-arm.

This revision is now accepted and ready to land.Jul 27 2021, 6:55 AM