This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1
ClosedPublic

Authored by david-arm on Jul 27 2022, 2:28 AM.

Details

Summary

This patch enables the tail-folding of simple loops by default
when targeting the neoverse-v1 CPU. Simple loops exclude those
with recurrences or reductions or loops that are reversed.

New tests have been added here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

In terms of SPEC2017 only one benchmark is really affected when
building with "-Ofast -mcpu=neoverse-v1 -flto", which is
(+ faster, - slower):

525.x264: +7.0%

Diff Detail

Event Timeline

david-arm created this revision.Jul 27 2022, 2:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 2:28 AM
david-arm requested review of this revision.Jul 27 2022, 2:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 2:28 AM

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

Matt added a subscriber: Matt.Jul 27 2022, 1:34 PM

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

By 'restriction' are you referring to how this patch limits the types of loop we consider for tail-folding on the N1? This decision is based purely on the observed results from running benchmarks on hardware.

sdesmalen added inline comments.Jul 28 2022, 8:02 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
73–74

This class seems to be mixing two concepts:

  • Features that the loop requires for tail folding (i.e. recurrences/reductions/anything-else).
  • What the target wants as the default and the parsing-logic for user-visible toggles.

This makes it quite tricky to follow the logic, especially the need for both NeedsDefault + DefaultBits and Add/RemoveBits`.

Perhaps it would make more sense to have two classes:

class TailFoldingKind;    // This encodes which features the loop requires
class TailFoldingOption;  // This encodes the default (which itself will be a TailFoldingKind object), and has logic to parse strings.

Then you can add a method such as TailFoldingKind TailFoldingOption::getSupportedMode() const { .. } that you can use to query if the target's mode satisfies the required TailFoldingKind from the Loop.

120–125

Perhaps I'm missing something, but this logic with AddBits and RemoveBits seems unnecessarily complicated. Can't you have a single bitfield and do something like this:

void set(uint8_t Flags, bool Enable) {
  if (Enable)
    Bits |= Flags;
  else
    Bits &= ~Flags;
}

?

3495–3496

This seems redundant given the line above that says 'Defaults to 0'.

How about creating a constructor for it that takes a TailFoldingOpts as operand, so that you can write TailFoldingKind Required(TFDisabled)?

3510

Is it worth creating a method for this, so that you don't have to expose the bit-field to users, e.g.

return TailFoldingKindLoc.satisfies(Required);

?

david-arm added inline comments.Jul 28 2022, 8:42 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
120–125

The reason for this is because at the time of parsing the string we have no idea what the default options are going to be. I basically wanted to avoid creating a dependency on the CPU such that the user has to put the -sve-tail-folding option after the -mcpu flag. In the tests I added two RUN lines for both "-mcpu=neoverse-v1 -sve-tail-folding=default" and "-sve-tail-folding=default -mcpu=neoverse-v1". In the latter case we can't build on top of the default bits because we don't yet have them at the time of parsing. An example I'm thinking of is this:

-sve-tail-folding=default+nosimple -mcpu=neoverse-v1

which is a bit daft (and we don't even have a nosimple option yet!). We only know the default (simple only for neoverse-v1) once we've parsed the -mcpu flag and therefore we can't remove the simple flag until later. So I tried doing this by keeping a record of the bits we want to add and remove, and apply them later. That's why a single 'Bits' field like you mentioned above doesn't work. I can have a look at your suggestion above and see if that solves the problem.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
73–74

This doesn't quite work because you've lost the position of the default value compared to the explicitly enabled/disabled options. It's almost as if you want to defer parsing until when the data is required and then have something like TailFoldingKindLoc.parseWithDefault(getTailFoldingDefaultForCPU()) . I think if you can do this, many of the other changes in the patch become necessary.

david-arm updated this revision to Diff 449273.Aug 2 2022, 6:13 AM
  • Renamed TailFoldingKind -> TailFoldingOption and introduced a simple TailFoldingKind class.
  • TailFoldingOption now stashes a copy of the option string and parses it on demand to ensure that bits are added and removed in the correct order.
  • Added a new satisfies(TailFoldingKind Required) interface to TailFoldingOption and simplified the logic in getScalarEpilogueLowering
david-arm marked 4 inline comments as done.Aug 2 2022, 6:15 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
73–74

I've tried to separate out the two concepts into two different classes, but I've kept the DefaultBits as part of the new TailFoldingOption class.

david-arm updated this revision to Diff 449298.Aug 2 2022, 8:29 AM
david-arm marked an inline comment as done.
  • Let cortex-a710 and x2 have the same SVE tail-folding defaults.
dmgreen added a subscriber: dmgreen.Aug 3 2022, 4:36 AM

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
131

Is this setting a global variable? I think it should be just a field in the subtarget (maybe a subtarget feature), that is potentially overridden by the option if it is present.

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hello. That sounds like the loops need unrolling - that's common in order to get the best ILP out of cores. Especially for in-order cores, but it is important for out of order cores too. Something like the Neoverse-N2 has 2 FP/ASIMD pipes, and in order to keep them busy you need to unroll small loops. The Neoverse-V1 has 4 FP/ASIMD pipes. But that's not limited to reductions, it applies to any small loop as far as I understand. This is the reason that we use a MaxInterleaveFactor of at least 2 for all cores in llvm, and some set it higher. I don't think that changes with SVE. It still needs to unroll the loop body, preferably with a predicated epilogue.

It is true that sometimes this extra unrolling is unhelpful for loops with small trip counts. But I was hoping that the predicated epilogue would help with that. I thought that was the plan? What happened to making an unrolled/unpredicated body with a predicated remainder? There will be a loops that are large enough that we do not unroll. Those probably make sense to predicate, so long as the performance is good (it will depend on the core). For Os too. For MVE it certainly made sense once we had worked through fixing all the performance degradations we could find, but those are very different cores which pay more heavily for code-structure inefficiencies. And they (almost) never need the unrolling.

For AArch64 the interleaving count will be based on the cost of the loop, among other things. Maybe this should be based on the cost of the loop too? Does tail folding really mean we cannot unroll?

Oh - and about -mcpu=generic - it'd difficult but it needs to be the best we can do for the most cores possible. Especially on Android systems where the chances of using a correct -mcpu are almost none. What that is I'm not entirely sure. I think maybe unpredicated loop with predicated remainder, with the option to change to something else in the future?

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

It sounds like you are hitting another bottleneck. Perhaps the amount of predication resources.

Hi @SjoerdMeijer, thanks for looking into this. I do actually already have a patch to enable this by default (https://reviews.llvm.org/D130618), where the default behaviour is tuned according to the CPU. I think this is what we want because the profile will change according to what CPU you're running on - some CPUs may handle reductions better than others.

I didn't know what the problem is with reductions. I now see you had a discussion with Dave here about reductions, but it looks it is still unclear why exactly that is a problem (at least to me).
But I thought that "simple" would be a good start as a default, which is what you're more or less doing here too. Well, you're setting it for a few cpus, but my feeling is that "simple" should be a very safe bet for all.

The decision in this patch may be incorrect for 128-bit vector implementations. I also ran SPEC2k17 on a SVE-enabled CPU as well and I remember I saw a small (2-3%) regression in parest or something like that, which is one of the reasons I didn't push the patch any further. I also think it's really important to run a much larger set of benchmarks besides SPEC2k17 and collect numbers to show the benefits, since there isn't much vectorisation actually going on in SPEC2k17.

I will collect data for SPEC FP. I don't know exactly how much vectorisation is going on there, but will see. By the way, most vectorisation happens in x264 which sees a significant uplift. My numbers on a 256 bit vector implementation sees neutral results for the other SPEC INT apps. I think this is a very good start.

One of the major problems with the currrent tail-folding implementation is that we make the decision before doing any cost analysis in the vectoriser, which isn't great because we may be forcing the vectoriser to take different code paths to if we didn't tail-fold. Ideally what we really want is to move to a model where the vectoriser has a two-dimensional matrix of costs considering the combination of VF and vectorisation style (e.g. tail-folding vs whole vector loops, etc.), and choose the most optimal combination.

Yeah, I am aware and noticed this while implementing tail-folding for MVE. This is a problem, but something we have to live with for a while I think as it is not very low hanging fruit to change this. That's my impression at least. However, with the "simple" tail-folding it is difficult to see how it would lead to regressions.

What are your ideas to progress this?

I saw a small (2-3%) regression in parest or something like that

Yep, I see a 2.7% regression in parest.

david-arm updated this revision to Diff 516418.Apr 24 2023, 8:32 AM
david-arm retitled this revision from [AArch64][LoopVectorize] Enable tail-folding by default for SVE to [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1.
david-arm edited the summary of this revision. (Show Details)
david-arm edited reviewers, added: dmgreen; removed: peterwaller-arm.
  • Rebase.
  • The patch now only enables tail-folding by default for neoverse-v1.

Hi @david-arm, I wonder if it makes sense to move some of these changes to a separate (NFC?) patch where you do the refactoring of the class, so that this patch becomes as simple as calling setSVETailFoldingDefaultOpts(TFSimple);. That way, if for any reason the patch needs to be reverted, we don't loose out on the refactoring. We can still test the current code with the llc options.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
43

Does this class still have any value? It only does bitwise |, & and ~, which you can inline below.

60

nit: Could you rename this to UnparsedOptionString or something?

60–61

Rather than storing the string and substrings, can you just do the parsing as part of the constructor of TailFoldingOption so that you only need to store the bits?

90

Instead of an llvm_unreachable, do you want to print the error message here?

112

This is only printing to stderr. Should this be a fatal error?

llvm/lib/Target/AArch64/Utils/AArch64BaseInfo.h
537

I know you're just moving this code around, but could you add a description of what 'simple' means? And maybe also add a description for TFRecurrences (are those first-order recurrences that reqequire a splice operation?)

(very minor nit: would it make sense to make this enum value 0x01 so that it's clear that anything not '1' is no longer a 'simple' loop?)

david-arm added inline comments.Apr 25 2023, 1:28 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
60–61

I seem to remember there was a problem with ordering where we had to ensure we got the same behaviour for each version of this command line:

-mcpu=native -mllvm -sve-tail-folding=default

and

-mllvm -sve-tail-folding=default -mcpu=native

which is tested in sve-tail-folding-option.ll. I think it meant that we could only deal with the options after the entire command line has been parsed, so we couldn't do it in the constructor. I'll double check though as I may have remembered incorrectly!

Do you intend this to be a real patch to review? It doesn't match my understanding of the theory, but perhaps something has changed?

Do you intend this to be a real patch to review? It doesn't match my understanding of the theory, but perhaps something has changed?

The short answer is - yes. As the patch says we want to enable tail-folding by default for simple loops on neoverse-v1 CPUs.

OK. My understanding was that the Neoverse-V1 was one of the worst cpus for tail folding due to the limited throughput of while instructions. Doesn't tail folding prevent interleaving too? Newer generations should do better than the V1.

This is one of the "simplest" loops I can think of. My understanding is that it will go around half the speed with this patch: https://godbolt.org/z/K4e3PMdar

Are you sure this shouldn't be based on whether the loop is small enough to desire interleaving? As in the difference between Simple and Reductions was never really about reductions, those were just the loops that hit the problem the hardest. (There may be some other issues with reductions codegen, but the limited interleaving/throughput still remains). The real problem is that we need to make sure we don't limit the interleaving for small loops, or that the while instructions become a bottleneck for throughput.

Allen added a subscriber: Allen.Apr 27 2023, 12:37 AM
david-arm updated this revision to Diff 518757.May 2 2023, 8:43 AM
  • Addressed review comments, split off some of the refactoring into a NFC patch (D149659).
  • Restricted tail-folding only to loops that exceed a certain instruction threshold, so that we get the benefit of interleaving for tight loops.
david-arm marked 5 inline comments as done.May 2 2023, 8:54 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
60–61

Hi @sdesmalen, so this isn't possible because (as mentioned above) you can't guarantee that the construction (in this case the operator=) of the option happens before setting the CPU with -mcpu=neoverse-v1. So you have to construct the bits only after you're guaranteed to have parsed *all* options on the command line. This is why we have to dynamically reconstruct the bits each time.

90

So we now do report an error when parsing the option with the operator= flag below, which means that this else case should really be unreachable now.

3510

Hi @dmgreen, I've tried to address your concerns about using tail-folding for very tight loops here. It's a little crude, but I've introduced an instruction threshold, below which we don't tail-fold. For the godbolt example you gave we won't use tail-folding, but for the SPEC2017 benchmarks (and many others) we will.

Thanks. This still makes me a bit nervous, considering what we know about predication and the performance results I've seen.

Can you explain more about why the limit is 10 instructions? As far as I could see the limit on interleaving in the vectorizer is a cost of 20, and with many instructions like geps, phis and branches being free that will be quite a bit more than 10 instructions. We could have the limit lower than the default for interleaving if that makes more sense, but 10 seems quite low.

david-arm marked an inline comment as done.May 3 2023, 8:51 AM

Thanks. This still makes me a bit nervous, considering what we know about predication and the performance results I've seen.

Can you explain more about why the limit is 10 instructions? As far as I could see the limit on interleaving in the vectorizer is a cost of 20, and with many instructions like geps, phis and branches being free that will be quite a bit more than 10 instructions. We could have the limit lower than the default for interleaving if that makes more sense, but 10 seems quite low.

I guess we have to choose a limit somewhere. Whatever number we pick there will always be an example that proves it's bad. The goal here is not to make it the fastest for every single case, which is not really possible as shown by holes we sometimes find in our cost model. We want to make it good overall for the majority of cases. This patch and the number chosen here are not fixed - these are things that we can evolve over time based on real evidence.

I specifically chose 10 as a starting point because that's based on the example you gave me:

void test(float *x, float *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] = -y[i];
}

which has 9 LLVM IR instructions. Ignoring tail-folding completely, if you build this with current HEAD of LLVM you'll notice that interleaving is slightly faster than not interleaving. However, when you change this to:

void test(float *x, float *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] -= y[i];
}

suddenly the non-interleaved version (-mllvm -force-vector-interleave=1) becomes 18% faster than the interleaved version on neoverse-v1. There is only one extra instruction in the loop, which makes that 10. This means that today we already have an upstream performance bug caused by a poor interleaving choice. Just by chance more than anything else, this loop becomes faster when applying this patch.

Using your suggestion of 20 proved detrimental for x264 performance on neoverse-v1, with ~5% performance regression.

Those number do not match what I see. The second version is 25% slower without interleaving in the example I tried. It can be dependant on the trip count though, or there could be something else going on. I will try to contact you to see where the discrepancy might lie.

david-arm updated this revision to Diff 521631.May 12 2023, 6:38 AM
david-arm edited the summary of this revision. (Show Details)
  • Bumped up the instruction threshold from 10 -> 15 in order to reduce the risk of causing possible regressions for tight loops. This still leaves us with a 7% win for x264, but there is also no longer a regression for parest because the loops are now too small to be tail-folded. This is more by chance however, since the problem with tail-folding for parest seems unrelated to loop size and due to problems with code quality.
paulwalker-arm accepted this revision.May 12 2023, 7:18 AM

I'm happy with this patch. It's perhaps not 100% clear cut but on average we're seeing more (important?) wins than losses and there's nothing binding here. We're in the middle of the LLVM 17 development cycle and thus if evidence presents a more valuable configuration then fine let's use that. Making the change also gives us more eyes on the testing and benchmarking of tail folding, which I see as a good thing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
42

Please can this be tail-folding to match the existing command line option naming.

This revision is now accepted and ready to land.May 12 2023, 7:18 AM

Thanks, a threshold of 15 will certainly help mitigate some of the issues in common cases. The results I have are still not looking great in places, but like we discussed some of this will be dependant on alignment and other issues like multiple ir instructions becoming a single aarch64 instruction. And there are certainly cases where this is making improvements, even if it still makes me nervous.

On a more technical note I don't think this works without setting -sve-tail-folding=default as nothing will set NeedsDefault.

david-arm updated this revision to Diff 522212.May 15 2023, 8:40 AM
  • Rename instruction threshold option to "sve-tail-folding-insn-threshold"
david-arm marked an inline comment as done.May 15 2023, 8:42 AM

On a more technical note I don't think this works without setting -sve-tail-folding=default as nothing will set NeedsDefault.

Thanks for pointing this out @dmgreen! I noticed this too - I've fixed the parent NFC patch so that we always set NeedsDefault=true in the absence of any explicit user-specified option. I added an extra RUN line in sve-tail-folding-option.ll to test this too.

This revision was landed with ongoing or failed builds.May 18 2023, 3:36 AM
This revision was automatically updated to reflect the committed changes.
dewen added a subscriber: dewen.Jul 9 2023, 7:18 PM