Page MenuHomePhabricator

[AArch64][LoopVectorize] Enable tail-folding by default for SVE
Needs ReviewPublic

Authored by david-arm on Jul 27 2022, 2:28 AM.

Details

Summary

This patch enables tail-folding by default for all loop types when
the SVE feature is present. However, as a further optimisation if
the user specifies either the neoverse-v1, x2 or a64fx CPU we
limit tail-folding to only simple loops, i.e. those that do not
contain reductions or first-order recurrences.

At the same time I had to fix getScalarEpilogueLowering in
LoopVectorize.cpp to ignore the TTI hook behaviour if the flag

-epilogue-vectorization-force-VF=X

is used, since that implies we *don't* want to use tail-folding.
Without this change these tests would fail:

Transforms/LoopVectorize/AArch64/sve-epilog-vect-*

New tests have been added here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Diff Detail

Unit TestsFailed

TimeTest
2,670 msx64 debian > SanitizerCommon-asan-x86_64-Linux.SanitizerCommon-asan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
Script: -- : 'RUN: at line 10'; DIR=/var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/sanitizer_common/asan-x86_64-Linux/Output/sanitizer_coverage_allowlist_ignorelist.cpp.tmp_workdir
2,310 msx64 debian > SanitizerCommon-lsan-x86_64-Linux.SanitizerCommon-lsan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
Script: -- : 'RUN: at line 10'; DIR=/var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/sanitizer_common/lsan-x86_64-Linux/Output/sanitizer_coverage_allowlist_ignorelist.cpp.tmp_workdir
2,490 msx64 debian > SanitizerCommon-msan-x86_64-Linux.SanitizerCommon-msan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
Script: -- : 'RUN: at line 10'; DIR=/var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/sanitizer_common/msan-x86_64-Linux/Output/sanitizer_coverage_allowlist_ignorelist.cpp.tmp_workdir
60,030 msx64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/LeakTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/fuzzer-leak.test.tmp-LeakTest
60,040 msx64 debian > libFuzzer.libFuzzer::value-profile-load.test
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/LoadTest.cpp -fsanitize-coverage=trace-gep -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/value-profile-load.test.tmp-LoadTest

Event Timeline

david-arm created this revision.Jul 27 2022, 2:28 AM
david-arm requested review of this revision.Jul 27 2022, 2:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 2:28 AM

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

Matt added a subscriber: Matt.Jul 27 2022, 1:34 PM

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

By 'restriction' are you referring to how this patch limits the types of loop we consider for tail-folding on the N1? This decision is based purely on the observed results from running benchmarks on hardware.

sdesmalen added inline comments.Jul 28 2022, 8:02 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72

This class seems to be mixing two concepts:

  • Features that the loop requires for tail folding (i.e. recurrences/reductions/anything-else).
  • What the target wants as the default and the parsing-logic for user-visible toggles.

This makes it quite tricky to follow the logic, especially the need for both NeedsDefault + DefaultBits and Add/RemoveBits`.

Perhaps it would make more sense to have two classes:

class TailFoldingKind;    // This encodes which features the loop requires
class TailFoldingOption;  // This encodes the default (which itself will be a TailFoldingKind object), and has logic to parse strings.

Then you can add a method such as TailFoldingKind TailFoldingOption::getSupportedMode() const { .. } that you can use to query if the target's mode satisfies the required TailFoldingKind from the Loop.

112–116

Perhaps I'm missing something, but this logic with AddBits and RemoveBits seems unnecessarily complicated. Can't you have a single bitfield and do something like this:

void set(uint8_t Flags, bool Enable) {
  if (Enable)
    Bits |= Flags;
  else
    Bits &= ~Flags;
}

?

3077–3078

This seems redundant given the line above that says 'Defaults to 0'.

How about creating a constructor for it that takes a TailFoldingOpts as operand, so that you can write TailFoldingKind Required(TFDisabled)?

3085

Is it worth creating a method for this, so that you don't have to expose the bit-field to users, e.g.

return TailFoldingKindLoc.satisfies(Required);

?

david-arm added inline comments.Jul 28 2022, 8:42 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
112–116

The reason for this is because at the time of parsing the string we have no idea what the default options are going to be. I basically wanted to avoid creating a dependency on the CPU such that the user has to put the -sve-tail-folding option after the -mcpu flag. In the tests I added two RUN lines for both "-mcpu=neoverse-v1 -sve-tail-folding=default" and "-sve-tail-folding=default -mcpu=neoverse-v1". In the latter case we can't build on top of the default bits because we don't yet have them at the time of parsing. An example I'm thinking of is this:

-sve-tail-folding=default+nosimple -mcpu=neoverse-v1

which is a bit daft (and we don't even have a nosimple option yet!). We only know the default (simple only for neoverse-v1) once we've parsed the -mcpu flag and therefore we can't remove the simple flag until later. So I tried doing this by keeping a record of the bits we want to add and remove, and apply them later. That's why a single 'Bits' field like you mentioned above doesn't work. I can have a look at your suggestion above and see if that solves the problem.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72

This doesn't quite work because you've lost the position of the default value compared to the explicitly enabled/disabled options. It's almost as if you want to defer parsing until when the data is required and then have something like TailFoldingKindLoc.parseWithDefault(getTailFoldingDefaultForCPU()) . I think if you can do this, many of the other changes in the patch become necessary.

david-arm updated this revision to Diff 449273.Aug 2 2022, 6:13 AM
  • Renamed TailFoldingKind -> TailFoldingOption and introduced a simple TailFoldingKind class.
  • TailFoldingOption now stashes a copy of the option string and parses it on demand to ensure that bits are added and removed in the correct order.
  • Added a new satisfies(TailFoldingKind Required) interface to TailFoldingOption and simplified the logic in getScalarEpilogueLowering
david-arm marked 4 inline comments as done.Aug 2 2022, 6:15 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72

I've tried to separate out the two concepts into two different classes, but I've kept the DefaultBits as part of the new TailFoldingOption class.

david-arm updated this revision to Diff 449298.Aug 2 2022, 8:29 AM
david-arm marked an inline comment as done.
  • Let cortex-a710 and x2 have the same SVE tail-folding defaults.
dmgreen added a subscriber: dmgreen.Aug 3 2022, 4:36 AM

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
122

Is this setting a global variable? I think it should be just a field in the subtarget (maybe a subtarget feature), that is potentially overridden by the option if it is present.

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hello. That sounds like the loops need unrolling - that's common in order to get the best ILP out of cores. Especially for in-order cores, but it is important for out of order cores too. Something like the Neoverse-N2 has 2 FP/ASIMD pipes, and in order to keep them busy you need to unroll small loops. The Neoverse-V1 has 4 FP/ASIMD pipes. But that's not limited to reductions, it applies to any small loop as far as I understand. This is the reason that we use a MaxInterleaveFactor of at least 2 for all cores in llvm, and some set it higher. I don't think that changes with SVE. It still needs to unroll the loop body, preferably with a predicated epilogue.

It is true that sometimes this extra unrolling is unhelpful for loops with small trip counts. But I was hoping that the predicated epilogue would help with that. I thought that was the plan? What happened to making an unrolled/unpredicated body with a predicated remainder? There will be a loops that are large enough that we do not unroll. Those probably make sense to predicate, so long as the performance is good (it will depend on the core). For Os too. For MVE it certainly made sense once we had worked through fixing all the performance degradations we could find, but those are very different cores which pay more heavily for code-structure inefficiencies. And they (almost) never need the unrolling.

For AArch64 the interleaving count will be based on the cost of the loop, among other things. Maybe this should be based on the cost of the loop too? Does tail folding really mean we cannot unroll?

Oh - and about -mcpu=generic - it'd difficult but it needs to be the best we can do for the most cores possible. Especially on Android systems where the chances of using a correct -mcpu are almost none. What that is I'm not entirely sure. I think maybe unpredicated loop with predicated remainder, with the option to change to something else in the future?

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

It sounds like you are hitting another bottleneck. Perhaps the amount of predication resources.