This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Target/AArch64/
-
AArch64/
-
AArch64Subtarget.cpp
16/18
AArch64TargetTransformInfo.cpp
-
Utils/
1
AArch64BaseInfo.h
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
sve-tail-folding-option.ll

Differential D130618

[AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1
ClosedPublic

Authored by david-arm on Jul 27 2022, 2:28 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
paulwalker-arm
kmclaughlin
dmgreen

Commits

rGc7dbe326dff8: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1

Summary

This patch enables the tail-folding of simple loops by default
when targeting the neoverse-v1 CPU. Simple loops exclude those
with recurrences or reductions or loops that are reversed.

New tests have been added here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

In terms of SPEC2017 only one benchmark is really affected when
building with "-Ofast -mcpu=neoverse-v1 -flto", which is
(+ faster, - slower):

525.x264: +7.0%

Diff Detail

Unit TestsFailed

	Time	Test
	2,670 ms	x64 debian > SanitizerCommon-asan-x86_64-Linux.SanitizerCommon-asan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
	2,310 ms	x64 debian > SanitizerCommon-lsan-x86_64-Linux.SanitizerCommon-lsan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
	2,490 ms	x64 debian > SanitizerCommon-msan-x86_64-Linux.SanitizerCommon-msan-x86_64-Linux::sanitizer_coverage_allowlist_ignorelist.cpp
	60,030 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
	60,040 ms	x64 debian > libFuzzer.libFuzzer::value-profile-load.test

Event Timeline

david-arm created this revision.Jul 27 2022, 2:28 AM

Herald added subscribers: shiva0217, ctetreau, CarolineConcatto and 3 others. · View Herald TranscriptJul 27 2022, 2:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 2:28 AM

david-arm requested review of this revision.Jul 27 2022, 2:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2022, 2:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B177811: Diff 447963.Jul 27 2022, 2:29 AM

david-arm added a parent revision: D128342: [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses.Jul 27 2022, 4:29 AM

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

Matt added a subscriber: Matt.Jul 27 2022, 1:34 PM

In D130618#3682600, @tschuett wrote:

Is the restriction the vector length in disguise or is something really special, i.e., the N2 has short SVE vectors?

By 'restriction' are you referring to how this patch limits the types of loop we consider for tail-folding on the N1? This decision is based purely on the observed results from running benchmarks on hardware.

sdesmalen added inline comments.Jul 28 2022, 8:02 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72	This class seems to be mixing two concepts: Features that the loop requires for tail folding (i.e. recurrences/reductions/anything-else). What the target wants as the default and the parsing-logic for user-visible toggles. This makes it quite tricky to follow the logic, especially the need for both `NeedsDefault` + `DefaultBits` and Add/RemoveBits`. Perhaps it would make more sense to have two classes: class TailFoldingKind; // This encodes which features the loop requires class TailFoldingOption; // This encodes the default (which itself will be a TailFoldingKind object), and has logic to parse strings. Then you can add a method such as `TailFoldingKind TailFoldingOption::getSupportedMode() const { .. }` that you can use to query if the target's mode satisfies the required TailFoldingKind from the Loop.
112–116	Perhaps I'm missing something, but this logic with `AddBits` and `RemoveBits` seems unnecessarily complicated. Can't you have a single bitfield and do something like this: void set(uint8_t Flags, bool Enable) { if (Enable) Bits \|= Flags; else Bits &= ~Flags; } ?
3077–3078	This seems redundant given the line above that says 'Defaults to 0'. How about creating a constructor for it that takes a TailFoldingOpts as operand, so that you can write `TailFoldingKind Required(TFDisabled)`?
3085	Is it worth creating a method for this, so that you don't have to expose the bit-field to users, e.g. return TailFoldingKindLoc.satisfies(Required); ?

david-arm added inline comments.Jul 28 2022, 8:42 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
112–116	The reason for this is because at the time of parsing the string we have no idea what the default options are going to be. I basically wanted to avoid creating a dependency on the CPU such that the user has to put the -sve-tail-folding option after the -mcpu flag. In the tests I added two RUN lines for both "-mcpu=neoverse-v1 -sve-tail-folding=default" and "-sve-tail-folding=default -mcpu=neoverse-v1". In the latter case we can't build on top of the default bits because we don't yet have them at the time of parsing. An example I'm thinking of is this: -sve-tail-folding=default+nosimple -mcpu=neoverse-v1 which is a bit daft (and we don't even have a nosimple option yet!). We only know the default (simple only for neoverse-v1) once we've parsed the -mcpu flag and therefore we can't remove the simple flag until later. So I tried doing this by keeping a record of the bits we want to add and remove, and apply them later. That's why a single 'Bits' field like you mentioned above doesn't work. I can have a look at your suggestion above and see if that solves the problem.

paulwalker-arm added inline comments.Jul 29 2022, 10:18 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72	This doesn't quite work because you've lost the position of the default value compared to the explicitly enabled/disabled options. It's almost as if you want to defer parsing until when the data is required and then have something like `TailFoldingKindLoc.parseWithDefault(getTailFoldingDefaultForCPU())` . I think if you can do this, many of the other changes in the patch become necessary.

Renamed TailFoldingKind -> TailFoldingOption and introduced a simple TailFoldingKind class.
TailFoldingOption now stashes a copy of the option string and parses it on demand to ensure that bits are added and removed in the correct order.
Added a new satisfies(TailFoldingKind Required) interface to TailFoldingOption and simplified the logic in getScalarEpilogueLowering

david-arm marked 4 inline comments as done.Aug 2 2022, 6:15 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
71–72	I've tried to separate out the two concepts into two different classes, but I've kept the DefaultBits as part of the new TailFoldingOption class.

Harbormaster completed remote builds in B178757: Diff 449273.Aug 2 2022, 7:23 AM

Let cortex-a710 and x2 have the same SVE tail-folding defaults.

Harbormaster completed remote builds in B178773: Diff 449298.Aug 2 2022, 9:04 AM

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
122	Is this setting a global variable? I think it should be just a field in the subtarget (maybe a subtarget feature), that is potentially overridden by the option if it is present.

In D130618#3696283, @dmgreen wrote:

Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too? I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.

And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hello. That sounds like the loops need unrolling - that's common in order to get the best ILP out of cores. Especially for in-order cores, but it is important for out of order cores too. Something like the Neoverse-N2 has 2 FP/ASIMD pipes, and in order to keep them busy you need to unroll small loops. The Neoverse-V1 has 4 FP/ASIMD pipes. But that's not limited to reductions, it applies to any small loop as far as I understand. This is the reason that we use a MaxInterleaveFactor of at least 2 for all cores in llvm, and some set it higher. I don't think that changes with SVE. It still needs to unroll the loop body, preferably with a predicated epilogue.

It is true that sometimes this extra unrolling is unhelpful for loops with small trip counts. But I was hoping that the predicated epilogue would help with that. I thought that was the plan? What happened to making an unrolled/unpredicated body with a predicated remainder? There will be a loops that are large enough that we do not unroll. Those probably make sense to predicate, so long as the performance is good (it will depend on the core). For Os too. For MVE it certainly made sense once we had worked through fixing all the performance degradations we could find, but those are very different cores which pay more heavily for code-structure inefficiencies. And they (almost) never need the unrolling.

For AArch64 the interleaving count will be based on the cost of the loop, among other things. Maybe this should be based on the cost of the loop too? Does tail folding really mean we cannot unroll?

Oh - and about -mcpu=generic - it'd difficult but it needs to be the best we can do for the most cores possible. Especially on Android systems where the chances of using a correct -mcpu are almost none. What that is I'm not entirely sure. I think maybe unpredicated loop with predicated remainder, with the option to change to something else in the future?

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

In D130618#3699606, @david-arm wrote:

Hi @dmgreen, we had exactly the same thought process as you. We have already explored unrolling tail-folded loops with reductions and even with the best code quality it makes zero difference to performance on these cores. Sadly it doesn't make the loops faster in the slightest - there appears to be a fundamental bottleneck that cannot be surpassed. No amount of unrolling (using LLVM or by hand) helps in the case of reductions or first-order recurrences. I have also tried unrolling plus manually rescheduling instructions in different ways, but to no avail. We were also very suprised by this, and wish it were different!

It sounds like you are hitting another bottleneck. Perhaps the amount of predication resources.

david-arm mentioned this in D138421: [AArch64][SVE] Enable Tail-Folding. WIP.Nov 21 2022, 6:12 AM

Hi @SjoerdMeijer, thanks for looking into this. I do actually already have a patch to enable this by default (https://reviews.llvm.org/D130618), where the default behaviour is tuned according to the CPU. I think this is what we want because the profile will change according to what CPU you're running on - some CPUs may handle reductions better than others.

I didn't know what the problem is with reductions. I now see you had a discussion with Dave here about reductions, but it looks it is still unclear why exactly that is a problem (at least to me).
But I thought that "simple" would be a good start as a default, which is what you're more or less doing here too. Well, you're setting it for a few cpus, but my feeling is that "simple" should be a very safe bet for all.

The decision in this patch may be incorrect for 128-bit vector implementations. I also ran SPEC2k17 on a SVE-enabled CPU as well and I remember I saw a small (2-3%) regression in parest or something like that, which is one of the reasons I didn't push the patch any further. I also think it's really important to run a much larger set of benchmarks besides SPEC2k17 and collect numbers to show the benefits, since there isn't much vectorisation actually going on in SPEC2k17.

I will collect data for SPEC FP. I don't know exactly how much vectorisation is going on there, but will see. By the way, most vectorisation happens in x264 which sees a significant uplift. My numbers on a 256 bit vector implementation sees neutral results for the other SPEC INT apps. I think this is a very good start.

One of the major problems with the currrent tail-folding implementation is that we make the decision before doing any cost analysis in the vectoriser, which isn't great because we may be forcing the vectoriser to take different code paths to if we didn't tail-fold. Ideally what we really want is to move to a model where the vectoriser has a two-dimensional matrix of costs considering the combination of VF and vectorisation style (e.g. tail-folding vs whole vector loops, etc.), and choose the most optimal combination.

Yeah, I am aware and noticed this while implementing tail-folding for MVE. This is a problem, but something we have to live with for a while I think as it is not very low hanging fruit to change this. That's my impression at least. However, with the "simple" tail-folding it is difficult to see how it would lead to regressions.

What are your ideas to progress this?

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptNov 21 2022, 7:36 AM

I saw a small (2-3%) regression in parest or something like that

Yep, I see a 2.7% regression in parest.

iamlouk added a subscriber: iamlouk.Feb 28 2023, 1:13 AM

Rebase.
The patch now only enables tail-folding by default for neoverse-v1.

Harbormaster completed remote builds in B227736: Diff 516418.Apr 24 2023, 9:02 AM

Hi @david-arm, I wonder if it makes sense to move some of these changes to a separate (NFC?) patch where you do the refactoring of the class, so that this patch becomes as simple as calling setSVETailFoldingDefaultOpts(TFSimple);. That way, if for any reason the patch needs to be reverted, we don't loose out on the refactoring. We can still test the current code with the llc options.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
41	Does this class still have any value? It only does bitwise \|, & and ~, which you can inline below.
58	nit: Could you rename this to UnparsedOptionString or something?
58–59	Rather than storing the string and substrings, can you just do the parsing as part of the constructor of TailFoldingOption so that you only need to store the bits?
84	Instead of an llvm_unreachable, do you want to print the error message here?
104	This is only printing to stderr. Should this be a fatal error?
llvm/lib/Target/AArch64/Utils/AArch64BaseInfo.h
519	I know you're just moving this code around, but could you add a description of what 'simple' means? And maybe also add a description for TFRecurrences (are those first-order recurrences that reqequire a splice operation?) (very minor nit: would it make sense to make this enum value `0x01` so that it's clear that anything not '1' is no longer a 'simple' loop?)

david-arm added inline comments.Apr 25 2023, 1:28 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
58–59	I seem to remember there was a problem with ordering where we had to ensure we got the same behaviour for each version of this command line: `-mcpu=native -mllvm -sve-tail-folding=default` and `-mllvm -sve-tail-folding=default -mcpu=native` which is tested in sve-tail-folding-option.ll. I think it meant that we could only deal with the options after the entire command line has been parsed, so we couldn't do it in the constructor. I'll double check though as I may have remembered incorrectly!

Do you intend this to be a real patch to review? It doesn't match my understanding of the theory, but perhaps something has changed?

In D130618#4294921, @dmgreen wrote:

Do you intend this to be a real patch to review? It doesn't match my understanding of the theory, but perhaps something has changed?

The short answer is - yes. As the patch says we want to enable tail-folding by default for simple loops on neoverse-v1 CPUs.

OK. My understanding was that the Neoverse-V1 was one of the worst cpus for tail folding due to the limited throughput of while instructions. Doesn't tail folding prevent interleaving too? Newer generations should do better than the V1.

This is one of the "simplest" loops I can think of. My understanding is that it will go around half the speed with this patch: https://godbolt.org/z/K4e3PMdar

Are you sure this shouldn't be based on whether the loop is small enough to desire interleaving? As in the difference between Simple and Reductions was never really about reductions, those were just the loops that hit the problem the hardest. (There may be some other issues with reductions codegen, but the limited interleaving/throughput still remains). The real problem is that we need to make sure we don't limit the interleaving for small loops, or that the while instructions become a bottleneck for throughput.

Allen added a subscriber: Allen.Apr 27 2023, 12:37 AM

Addressed review comments, split off some of the refactoring into a NFC patch (D149659).
Restricted tail-folding only to loops that exceed a certain instruction threshold, so that we get the benefit of interleaving for tight loops.

david-arm added a parent revision: D149659: [AArch64][NFC] Refactor the tail-folding option.May 2 2023, 8:43 AM

david-arm marked 5 inline comments as done.May 2 2023, 8:54 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
58–59	Hi @sdesmalen, so this isn't possible because (as mentioned above) you can't guarantee that the construction (in this case the `operator=`) of the option happens before setting the CPU with `-mcpu=neoverse-v1`. So you have to construct the bits only after you're guaranteed to have parsed all options on the command line. This is why we have to dynamically reconstruct the bits each time.
84	So we now do report an error when parsing the option with the `operator=` flag below, which means that this else case should really be unreachable now.
3084–3085	Hi @dmgreen, I've tried to address your concerns about using tail-folding for very tight loops here. It's a little crude, but I've introduced an instruction threshold, below which we don't tail-fold. For the godbolt example you gave we won't use tail-folding, but for the SPEC2017 benchmarks (and many others) we will.

Harbormaster completed remote builds in B229454: Diff 518757.May 2 2023, 9:42 AM

Thanks. This still makes me a bit nervous, considering what we know about predication and the performance results I've seen.

Can you explain more about why the limit is 10 instructions? As far as I could see the limit on interleaving in the vectorizer is a cost of 20, and with many instructions like geps, phis and branches being free that will be quite a bit more than 10 instructions. We could have the limit lower than the default for interleaving if that makes more sense, but 10 seems quite low.

In D130618#4315481, @dmgreen wrote:

Thanks. This still makes me a bit nervous, considering what we know about predication and the performance results I've seen.

Can you explain more about why the limit is 10 instructions? As far as I could see the limit on interleaving in the vectorizer is a cost of 20, and with many instructions like geps, phis and branches being free that will be quite a bit more than 10 instructions. We could have the limit lower than the default for interleaving if that makes more sense, but 10 seems quite low.

I guess we have to choose a limit somewhere. Whatever number we pick there will always be an example that proves it's bad. The goal here is not to make it the fastest for every single case, which is not really possible as shown by holes we sometimes find in our cost model. We want to make it good overall for the majority of cases. This patch and the number chosen here are not fixed - these are things that we can evolve over time based on real evidence.

I specifically chose 10 as a starting point because that's based on the example you gave me:

void test(float *x, float *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] = -y[i];
}

which has 9 LLVM IR instructions. Ignoring tail-folding completely, if you build this with current HEAD of LLVM you'll notice that interleaving is slightly faster than not interleaving. However, when you change this to:

void test(float *x, float *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] -= y[i];
}

suddenly the non-interleaved version (-mllvm -force-vector-interleave=1) becomes 18% faster than the interleaved version on neoverse-v1. There is only one extra instruction in the loop, which makes that 10. This means that today we already have an upstream performance bug caused by a poor interleaving choice. Just by chance more than anything else, this loop becomes faster when applying this patch.

Using your suggestion of 20 proved detrimental for x264 performance on neoverse-v1, with ~5% performance regression.

Those number do not match what I see. The second version is 25% slower without interleaving in the example I tried. It can be dependant on the trip count though, or there could be something else going on. I will try to contact you to see where the discrepancy might lie.

Rebase.

Harbormaster completed remote builds in B231080: Diff 520968.May 10 2023, 7:05 AM

Bumped up the instruction threshold from 10 -> 15 in order to reduce the risk of causing possible regressions for tight loops. This still leaves us with a 7% win for x264, but there is also no longer a regression for parest because the loops are now too small to be tail-folded. This is more by chance however, since the problem with tail-folding for parest seems unrelated to loop size and due to problems with code quality.

I'm happy with this patch. It's perhaps not 100% clear cut but on average we're seeing more (important?) wins than losses and there's nothing binding here. We're in the middle of the LLVM 17 development cycle and thus if evidence presents a more valuable configuration then fine let's use that. Making the change also gives us more eyes on the testing and benchmarking of tail folding, which I see as a good thing.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
41	Please can this be `tail-folding` to match the existing command line option naming.

This revision is now accepted and ready to land.May 12 2023, 7:18 AM

Harbormaster completed remote builds in B231586: Diff 521631.May 12 2023, 7:22 AM

Thanks, a threshold of 15 will certainly help mitigate some of the issues in common cases. The results I have are still not looking great in places, but like we discussed some of this will be dependant on alignment and other issues like multiple ir instructions becoming a single aarch64 instruction. And there are certainly cases where this is making improvements, even if it still makes me nervous.

On a more technical note I don't think this works without setting -sve-tail-folding=default as nothing will set NeedsDefault.

Rebase.

Rename instruction threshold option to "sve-tail-folding-insn-threshold"

In D130618#4341270, @dmgreen wrote:

On a more technical note I don't think this works without setting -sve-tail-folding=default as nothing will set NeedsDefault.

Thanks for pointing this out @dmgreen! I noticed this too - I've fixed the parent NFC patch so that we always set NeedsDefault=true in the absence of any explicit user-specified option. I added an extra RUN line in sve-tail-folding-option.ll to test this too.

Harbormaster completed remote builds in B232025: Diff 522212.May 15 2023, 10:20 AM

Thanks

Rebase

This revision was landed with ongoing or failed builds.May 18 2023, 3:36 AM

Closed by commit rGc7dbe326dff8: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1 (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGc7dbe326dff8: [AArch64][LoopVectorize] Enable tail-folding of simple loops on neoverse-v1.

dewen added a subscriber: dewen.Jul 9 2023, 7:18 PM

Herald added subscribers: wangpc, artagnon. · View Herald TranscriptJul 9 2023, 7:18 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64Subtarget.cpp

3 lines

AArch64TargetTransformInfo.cpp

102 lines

Utils/

AArch64BaseInfo.h

10 lines

Transforms/

Vectorize/

LoopVectorize.cpp

5 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-tail-folding-option.ll

46 lines

Diff 449298

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	case CortexA510:
MaxBytesForLoopAlignment = 8;		MaxBytesForLoopAlignment = 8;
break;		break;
case CortexA710:		case CortexA710:
case CortexX2:		case CortexX2:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
VScaleForTuning = 1;		VScaleForTuning = 1;
PrefLoopLogAlignment = 5;		PrefLoopLogAlignment = 5;
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
		setSVETailFoldingDefaultOpts(TFSimple);
break;		break;
case A64FX:		case A64FX:
CacheLineSize = 256;		CacheLineSize = 256;
PrefFunctionLogAlignment = 3;		PrefFunctionLogAlignment = 3;
PrefLoopLogAlignment = 2;		PrefLoopLogAlignment = 2;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
PrefetchDistance = 128;		PrefetchDistance = 128;
MinPrefetchStride = 1024;		MinPrefetchStride = 1024;
MaxPrefetchIterationsAhead = 4;		MaxPrefetchIterationsAhead = 4;
VScaleForTuning = 4;		VScaleForTuning = 4;
		setSVETailFoldingDefaultOpts(TFSimple);
break;		break;
case AppleA7:		case AppleA7:
case AppleA10:		case AppleA10:
case AppleA11:		case AppleA11:
case AppleA12:		case AppleA12:
case AppleA13:		case AppleA13:
case AppleA14:		case AppleA14:
CacheLineSize = 64;		CacheLineSize = 64;
Show All 40 Lines	case NeoverseN2:
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
VScaleForTuning = 1;		VScaleForTuning = 1;
break;		break;
case NeoverseV1:		case NeoverseV1:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
PrefLoopLogAlignment = 5;		PrefLoopLogAlignment = 5;
MaxBytesForLoopAlignment = 16;		MaxBytesForLoopAlignment = 16;
VScaleForTuning = 2;		VScaleForTuning = 2;
		setSVETailFoldingDefaultOpts(TFSimple);
break;		break;
case Neoverse512TVB:		case Neoverse512TVB:
PrefFunctionLogAlignment = 4;		PrefFunctionLogAlignment = 4;
VScaleForTuning = 1;		VScaleForTuning = 1;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
break;		break;
case Saphira:		case Saphira:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show All 32 Lines	static cl::opt<bool> EnableFalkorHWPFUnrollFix("enable-falkor-hwpf-unroll-fix",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),		static cl::opt<unsigned> SVEGatherOverhead("sve-gather-overhead", cl::init(10),
cl::Hidden);		cl::Hidden);

static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",		static cl::opt<unsigned> SVEScatterOverhead("sve-scatter-overhead",
cl::init(10), cl::Hidden);		cl::init(10), cl::Hidden);

class TailFoldingKind {		class TailFoldingKind {
		sdesmalenUnsubmitted Done Reply Inline Actions Does this class still have any value? It only does bitwise \|, & and ~, which you can inline below. sdesmalen: Does this class still have any value? It only does bitwise \|, & and ~, which you can inline…
		paulwalker-armUnsubmitted Done Reply Inline Actions Please can this be `tail-folding` to match the existing command line option naming. paulwalker-arm: Please can this be `tail-folding` to match the existing command line option naming.
private:		private:
uint8_t Bits = 0; // Currently defaults to disabled.		uint8_t Bits;

public:		public:
enum TailFoldingOpts {		TailFoldingKind(uint8_t Bits) : Bits(Bits) {}
TFDisabled = 0x0,
TFReductions = 0x01,		void add(uint8_t Bit) { Bits \|= Bit; }
TFRecurrences = 0x02,
TFSimple = 0x80,		void remove(uint8_t Bit) { Bits &= ~Bit; }
TFAll = TFReductions \| TFRecurrences \| TFSimple
		operator uint8_t() const { return Bits; }
};		};

void operator=(const std::string &Val) {		class TailFoldingOption {
if (Val.empty())		private:
return;		uint8_t DefaultBits = TFAll;
SmallVector<StringRef, 6> TailFoldTypes;		std::string OrigVal;
		sdesmalenUnsubmitted Done Reply Inline Actions nit: Could you rename this to UnparsedOptionString or something? sdesmalen: nit: Could you rename this to UnparsedOptionString or something?
StringRef(Val).split(TailFoldTypes, '+', -1, false);		SmallVector<StringRef, 4> TailFoldTypes;
		sdesmalenUnsubmitted Done Reply Inline Actions Rather than storing the string and substrings, can you just do the parsing as part of the constructor of TailFoldingOption so that you only need to store the bits? sdesmalen: Rather than storing the string and substrings, can you just do the parsing as part of the…
		david-armAuthorUnsubmitted Done Reply Inline Actions I seem to remember there was a problem with ordering where we had to ensure we got the same behaviour for each version of this command line: `-mcpu=native -mllvm -sve-tail-folding=default` and `-mllvm -sve-tail-folding=default -mcpu=native` which is tested in sve-tail-folding-option.ll. I think it meant that we could only deal with the options after the entire command line has been parsed, so we couldn't do it in the constructor. I'll double check though as I may have remembered incorrectly! david-arm: I seem to remember there was a problem with ordering where we had to ensure we got the same…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, so this isn't possible because (as mentioned above) you can't guarantee that the construction (in this case the `operator=`) of the option happens before setting the CPU with `-mcpu=neoverse-v1`. So you have to construct the bits only after you're guaranteed to have parsed all options on the command line. This is why we have to dynamically reconstruct the bits each time. david-arm: Hi @sdesmalen, so this isn't possible because (as mentioned above) you can't guarantee that the…

		uint8_t getBits() const {
		if (!TailFoldTypes.size())
		return DefaultBits;

		TailFoldingKind Bits(0);
for (auto TailFoldType : TailFoldTypes) {		for (auto TailFoldType : TailFoldTypes) {
if (TailFoldType == "disabled")		if (TailFoldType == "disabled")
Bits = 0;		Bits.remove(TFAll);
else if (TailFoldType == "all")		else if (TailFoldType == "all")
Bits = TFAll;		Bits.add(TFAll);
else if (TailFoldType == "default")		else if (TailFoldType == "default")
Bits = 0; // Currently defaults to never tail-folding.		Bits.add(DefaultBits);
		sdesmalenUnsubmitted Done Reply Inline Actions This class seems to be mixing two concepts: Features that the loop requires for tail folding (i.e. recurrences/reductions/anything-else). What the target wants as the default and the parsing-logic for user-visible toggles. This makes it quite tricky to follow the logic, especially the need for both `NeedsDefault` + `DefaultBits` and Add/RemoveBits`. Perhaps it would make more sense to have two classes: class TailFoldingKind; // This encodes which features the loop requires class TailFoldingOption; // This encodes the default (which itself will be a TailFoldingKind object), and has logic to parse strings. Then you can add a method such as `TailFoldingKind TailFoldingOption::getSupportedMode() const { .. }` that you can use to query if the target's mode satisfies the required TailFoldingKind from the Loop. sdesmalen: This class seems to be mixing two concepts: * Features that the loop requires for tail folding…
		david-armAuthorUnsubmitted Done Reply Inline Actions I've tried to separate out the two concepts into two different classes, but I've kept the DefaultBits as part of the new TailFoldingOption class. david-arm: I've tried to separate out the two concepts into two different classes, but I've kept the…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This doesn't quite work because you've lost the position of the default value compared to the explicitly enabled/disabled options. It's almost as if you want to defer parsing until when the data is required and then have something like `TailFoldingKindLoc.parseWithDefault(getTailFoldingDefaultForCPU())` . I think if you can do this, many of the other changes in the patch become necessary. paulwalker-arm: This doesn't quite work because you've lost the position of the default value compared to the…
else if (TailFoldType == "simple")		else if (TailFoldType == "simple")
add(TFSimple);		Bits.add(TFSimple);
else if (TailFoldType == "reductions")		else if (TailFoldType == "reductions")
add(TFReductions);		Bits.add(TFReductions);
else if (TailFoldType == "recurrences")		else if (TailFoldType == "recurrences")
add(TFRecurrences);		Bits.add(TFRecurrences);
else if (TailFoldType == "noreductions")		else if (TailFoldType == "noreductions")
remove(TFReductions);		Bits.remove(TFReductions);
else if (TailFoldType == "norecurrences")		else if (TailFoldType == "norecurrences")
remove(TFRecurrences);		Bits.remove(TFRecurrences);
else {		else
		llvm_unreachable("No! That's impossible!");
		sdesmalenUnsubmitted Done Reply Inline Actions Instead of an llvm_unreachable, do you want to print the error message here? sdesmalen: Instead of an llvm_unreachable, do you want to print the error message here?
		david-armAuthorUnsubmitted Done Reply Inline Actions So we now do report an error when parsing the option with the `operator=` flag below, which means that this else case should really be unreachable now. david-arm: So we now do report an error when parsing the option with the `operator=` flag below, which…
		}

		return Bits;
		}

		public:
		void setDefault(uint8_t V) { DefaultBits = V; }

		void operator=(const std::string &Val) {
		if (Val.empty())
		return;

		OrigVal = Val;
		StringRef(OrigVal).split(TailFoldTypes, '+', -1, false);
		for (auto TailFoldType : TailFoldTypes) {
		if (TailFoldType != "disabled" && TailFoldType != "all" &&
		TailFoldType != "default" && TailFoldType != "simple" &&
		TailFoldType != "reductions" && TailFoldType != "recurrences" &&
		TailFoldType != "noreductions" && TailFoldType != "norecurrences") {
errs()		errs()
		sdesmalenUnsubmitted Not Done Reply Inline Actions This is only printing to stderr. Should this be a fatal error? sdesmalen: This is only printing to stderr. Should this be a fatal error?
<< "invalid argument " << TailFoldType.str()		<< "invalid argument " << TailFoldType.str()
<< " to -sve-tail-folding=; each element must be one of: disabled, "		<< " to -sve-tail-folding=; each element must be one of: disabled, "
"all, default, simple, reductions, noreductions, recurrences, "		"all, default, simple, reductions, noreductions, recurrences, "
"norecurrences\n";		"norecurrences\n";
}		}
}		}
}		}

operator uint8_t() const { return Bits; }		bool satisfies(TailFoldingKind Required) const {
		return (getBits() & Required) == Required;
void add(uint8_t Flag) { Bits \|= Flag; }		}
void remove(uint8_t Flag) { Bits &= ~Flag; }
};		};
		sdesmalenUnsubmitted Done Reply Inline Actions Perhaps I'm missing something, but this logic with `AddBits` and `RemoveBits` seems unnecessarily complicated. Can't you have a single bitfield and do something like this: void set(uint8_t Flags, bool Enable) { if (Enable) Bits \|= Flags; else Bits &= ~Flags; } ? sdesmalen: Perhaps I'm missing something, but this logic with `AddBits` and `RemoveBits` seems…
		david-armAuthorUnsubmitted Done Reply Inline Actions The reason for this is because at the time of parsing the string we have no idea what the default options are going to be. I basically wanted to avoid creating a dependency on the CPU such that the user has to put the -sve-tail-folding option after the -mcpu flag. In the tests I added two RUN lines for both "-mcpu=neoverse-v1 -sve-tail-folding=default" and "-sve-tail-folding=default -mcpu=neoverse-v1". In the latter case we can't build on top of the default bits because we don't yet have them at the time of parsing. An example I'm thinking of is this: -sve-tail-folding=default+nosimple -mcpu=neoverse-v1 which is a bit daft (and we don't even have a nosimple option yet!). We only know the default (simple only for neoverse-v1) once we've parsed the -mcpu flag and therefore we can't remove the simple flag until later. So I tried doing this by keeping a record of the bits we want to add and remove, and apply them later. That's why a single 'Bits' field like you mentioned above doesn't work. I can have a look at your suggestion above and see if that solves the problem. david-arm: The reason for this is because at the time of parsing the string we have no idea what the…

TailFoldingKind TailFoldingKindLoc;		TailFoldingOption TailFoldingOptionLoc;

		namespace llvm {
		void setSVETailFoldingDefaultOpts(uint8_t V) {
		TailFoldingOptionLoc.setDefault(V);
		dmgreenUnsubmitted Done Reply Inline Actions Is this setting a global variable? I think it should be just a field in the subtarget (maybe a subtarget feature), that is potentially overridden by the option if it is present. dmgreen: Is this setting a global variable? I think it should be just a field in the subtarget (maybe a…
		}
		} // namespace llvm

cl::opt<TailFoldingKind, true, cl::parser<std::string>> SVETailFolding(		cl::opt<TailFoldingOption, true, cl::parser<std::string>> SVETailFolding(
"sve-tail-folding",		"sve-tail-folding",
cl::desc(		cl::desc(
"Control the use of vectorisation using tail-folding for SVE:"		"Control the use of vectorisation using tail-folding for SVE:"
"\ndisabled No loop types will vectorize using tail-folding"		"\ndisabled No loop types will vectorize using tail-folding"
"\ndefault Uses the default tail-folding settings for the target "		"\ndefault Uses the default tail-folding settings for the target "
"CPU"		"CPU"
"\nall All legal loop types will vectorize using tail-folding"		"\nall All legal loop types will vectorize using tail-folding"
"\nsimple Use tail-folding for simple loops (not reductions or "		"\nsimple Use tail-folding for simple loops (not reductions or "
"recurrences)"		"recurrences)"
"\nreductions Use tail-folding for loops containing reductions"		"\nreductions Use tail-folding for loops containing reductions"
"\nrecurrences Use tail-folding for loops containing first order "		"\nrecurrences Use tail-folding for loops containing first order "
"recurrences"),		"recurrences"),
cl::location(TailFoldingKindLoc));		cl::location(TailFoldingOptionLoc));

bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,		bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
const TargetMachine &TM = getTLI()->getTargetMachine();		const TargetMachine &TM = getTLI()->getTargetMachine();

const FeatureBitset &CallerBits =		const FeatureBitset &CallerBits =
TM.getSubtargetImpl(*Caller)->getFeatureBits();		TM.getSubtargetImpl(*Caller)->getFeatureBits();
const FeatureBitset &CalleeBits =		const FeatureBitset &CalleeBits =
▲ Show 20 Lines • Show All 2,912 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind,

return BaseT::getShuffleCost(Kind, Tp, Mask, Index, SubTp);		return BaseT::getShuffleCost(Kind, Tp, Mask, Index, SubTp);
}		}

bool AArch64TTIImpl::preferPredicateOverEpilogue(		bool AArch64TTIImpl::preferPredicateOverEpilogue(
Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,		Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL,		TargetLibraryInfo TLI, DominatorTree DT, LoopVectorizationLegality *LVL,
InterleavedAccessInfo *IAI) {		InterleavedAccessInfo *IAI) {
if (!ST->hasSVE() \|\| TailFoldingKindLoc == TailFoldingKind::TFDisabled)		if (!ST->hasSVE())
return false;		return false;

// We don't currently support vectorisation with interleaving for SVE - with		// We don't currently support vectorisation with interleaving for SVE - with
// such loops we're better off not using tail-folding. This gives us a chance		// such loops we're better off not using tail-folding. This gives us a chance
// to fall back on fixed-width vectorisation using NEON's ld2/st2/etc.		// to fall back on fixed-width vectorisation using NEON's ld2/st2/etc.
if (IAI->hasGroups())		if (IAI->hasGroups())
return false;		return false;

TailFoldingKind Required; // Defaults to 0.		TailFoldingKind Required(0);
if (LVL->getReductionVars().size())		if (LVL->getReductionVars().size())
		sdesmalenUnsubmitted Done Reply Inline Actions This seems redundant given the line above that says 'Defaults to 0'. How about creating a constructor for it that takes a TailFoldingOpts as operand, so that you can write `TailFoldingKind Required(TFDisabled)`? sdesmalen: This seems redundant given the line above that says 'Defaults to 0'. How about creating a…
Required.add(TailFoldingKind::TFReductions);		Required.add(TFReductions);
if (LVL->getFirstOrderRecurrences().size())		if (LVL->getFirstOrderRecurrences().size())
Required.add(TailFoldingKind::TFRecurrences);		Required.add(TFRecurrences);
if (!Required)		if (!Required)
Required.add(TailFoldingKind::TFSimple);		Required.add(TFSimple);

return (TailFoldingKindLoc & Required) == Required;		return TailFoldingOptionLoc.satisfies(Required);
		sdesmalenUnsubmitted Done Reply Inline Actions Is it worth creating a method for this, so that you don't have to expose the bit-field to users, e.g. return TailFoldingKindLoc.satisfies(Required); ? sdesmalen: Is it worth creating a method for this, so that you don't have to expose the bit-field to users…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @dmgreen, I've tried to address your concerns about using tail-folding for very tight loops here. It's a little crude, but I've introduced an instruction threshold, below which we don't tail-fold. For the godbolt example you gave we won't use tail-folding, but for the SPEC2017 benchmarks (and many others) we will. david-arm: Hi @dmgreen, I've tried to address your concerns about using tail-folding for very tight loops…
}		}

llvm/lib/Target/AArch64/Utils/AArch64BaseInfo.h

Show First 20 Lines • Show All 505 Lines • ▼ Show 20 Lines	case 64:
return AArch64SVEPredPattern::vl64;		return AArch64SVEPredPattern::vl64;
case 128:		case 128:
return AArch64SVEPredPattern::vl128;		return AArch64SVEPredPattern::vl128;
case 256:		case 256:
return AArch64SVEPredPattern::vl256;		return AArch64SVEPredPattern::vl256;
}		}
}		}

		enum TailFoldingOpts : uint8_t {
		TFDisabled = 0x0,
		TFReductions = 0x01,
		TFRecurrences = 0x02,
		TFSimple = 0x80,
		TFAll = TFReductions \| TFRecurrences \| TFSimple
		sdesmalenUnsubmitted Not Done Reply Inline Actions I know you're just moving this code around, but could you add a description of what 'simple' means? And maybe also add a description for TFRecurrences (are those first-order recurrences that reqequire a splice operation?) (very minor nit: would it make sense to make this enum value `0x01` so that it's clear that anything not '1' is no longer a 'simple' loop?) sdesmalen: I know you're just moving this code around, but could you add a description of what 'simple'…
		};

		void setSVETailFoldingDefaultOpts(uint8_t);

namespace AArch64ExactFPImm {		namespace AArch64ExactFPImm {
struct ExactFPImm {		struct ExactFPImm {
const char *Name;		const char *Name;
int Enum;		int Enum;
const char *Repr;		const char *Repr;
};		};
#define GET_EXACTFPIMM_DECL		#define GET_EXACTFPIMM_DECL
#include "AArch64GenSystemOperands.inc"		#include "AArch64GenSystemOperands.inc"
▲ Show 20 Lines • Show All 246 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,740 Lines • ▼ Show 20 Lines	static ScalarEpilogueLowering getScalarEpilogueLowering(
// 3) If set, obey the hints		// 3) If set, obey the hints
switch (Hints.getPredicate()) {		switch (Hints.getPredicate()) {
case LoopVectorizeHints::FK_Enabled:		case LoopVectorizeHints::FK_Enabled:
return CM_ScalarEpilogueNotNeededUsePredicate;		return CM_ScalarEpilogueNotNeededUsePredicate;
case LoopVectorizeHints::FK_Disabled:		case LoopVectorizeHints::FK_Disabled:
return CM_ScalarEpilogueAllowed;		return CM_ScalarEpilogueAllowed;
};		};

		// If we're forcing the use of epilogue vectorization we should honour that
		// instead of the TTI hook behaviour.
		if (EpilogueVectorizationForceVF.getNumOccurrences())
		return CM_ScalarEpilogueAllowed;

// 4) if the TTI hook indicates this is profitable, request predication.		// 4) if the TTI hook indicates this is profitable, request predication.
if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, &LVL, IAI))		if (TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, &LVL, IAI))
return CM_ScalarEpilogueNotNeededUsePredicate;		return CM_ScalarEpilogueNotNeededUsePredicate;

return CM_ScalarEpilogueAllowed;		return CM_ScalarEpilogueAllowed;
}		}

Value VPTransformState::get(VPValue Def, unsigned Part) {		Value VPTransformState::get(VPValue Def, unsigned Part) {
▲ Show 20 Lines • Show All 803 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

	; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled -S \| FileCheck %s -check-prefix=CHECK-NOTF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled -S \| FileCheck %s -check-prefix=CHECK-NOTF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S \| FileCheck %s -check-prefix=CHECK-NOTF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S \| FileCheck %s -check-prefix=CHECK-TF
				; RUN: opt < %s -loop-vectorize -S \| FileCheck %s -check-prefix=CHECK-TF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all -S \| FileCheck %s -check-prefix=CHECK-TF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=all -S \| FileCheck %s -check-prefix=CHECK-TF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled+simple+reductions+recurrences -S \| FileCheck %s -check-prefix=CHECK-TF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=default+disabled+simple+reductions+recurrences -S \| FileCheck %s -check-prefix=CHECK-TF
				; RUN: opt < %s -loop-vectorize -S -mcpu=neoverse-v1 -sve-tail-folding=default+reductions+recurrences \| FileCheck %s -check-prefix=CHECK-TF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+noreductions -S \| FileCheck %s -check-prefix=CHECK-TF-NORED			; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+noreductions -S \| FileCheck %s -check-prefix=CHECK-TF-NORED
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+norecurrences -S \| FileCheck %s -check-prefix=CHECK-TF-NOREC			; RUN: opt < %s -loop-vectorize -sve-tail-folding=default+norecurrences -S \| FileCheck %s -check-prefix=CHECK-TF-NOREC
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=reductions -S \| FileCheck %s -check-prefix=CHECK-TF-ONLYRED			; RUN: opt < %s -loop-vectorize -sve-tail-folding=reductions -S \| FileCheck %s -check-prefix=CHECK-TF-ONLYRED
				; RUN: opt < %s -loop-vectorize -S -sve-tail-folding=default -mcpu=neoverse-v1 \| FileCheck %s -check-prefix=CHECK-NEOVERSE-V1
				; RUN: opt < %s -loop-vectorize -S -mcpu=neoverse-v1 -sve-tail-folding=default \| FileCheck %s -check-prefix=CHECK-NEOVERSE-V1

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-NOTF-LABEL: @simple_memset(			; CHECK-NOTF-LABEL: @simple_memset(
	; CHECK-NOTF: vector.ph:			; CHECK-NOTF: vector.ph:
	; CHECK-NOTF: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0			; CHECK-NOTF: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
	; CHECK-NOTF: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NOTF: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	Show All 28 Lines
	; CHECK-TF-ONLYRED-LABEL: @simple_memset(			; CHECK-TF-ONLYRED-LABEL: @simple_memset(
	; CHECK-TF-ONLYRED: vector.ph:			; CHECK-TF-ONLYRED: vector.ph:
	; CHECK-TF-ONLYRED: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0			; CHECK-TF-ONLYRED: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
	; CHECK-TF-ONLYRED: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-TF-ONLYRED: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-TF-ONLYRED: vector.body:			; CHECK-TF-ONLYRED: vector.body:
	; CHECK-TF-ONLYRED-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-TF-ONLYRED-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-TF-ONLYRED: store <vscale x 4 x i32> %[[SPLAT]], <vscale x 4 x i32>*			; CHECK-TF-ONLYRED: store <vscale x 4 x i32> %[[SPLAT]], <vscale x 4 x i32>*

				; CHECK-NEOVERSE-V1-LABEL: @simple_memset(
				; CHECK-NEOVERSE-V1: vector.ph:
				; CHECK-NEOVERSE-V1: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
				; CHECK-NEOVERSE-V1: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEOVERSE-V1: vector.body:
				; CHECK-NEOVERSE-V1: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
				; CHECK-NEOVERSE-V1: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[SPLAT]], {{.*}} %[[ACTIVE_LANE_MASK]]

	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	%gep = getelementptr i32, i32* %ptr, i64 %index			%gep = getelementptr i32, i32* %ptr, i64 %index
	store i32 %val, i32* %gep			store i32 %val, i32* %gep
	%index.next = add nsw i64 %index, 1			%index.next = add nsw i64 %index, 1
	▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; CHECK-TF-ONLYRED: vector.body:			; CHECK-TF-ONLYRED: vector.body:
	; CHECK-TF-ONLYRED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>			; CHECK-TF-ONLYRED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
	; CHECK-TF-ONLYRED: %[[VEC_PHI:.*]] = phi <vscale x 4 x float>			; CHECK-TF-ONLYRED: %[[VEC_PHI:.*]] = phi <vscale x 4 x float>
	; CHECK-TF-ONLYRED: %[[LOAD:.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float>{{.}} %[[ACTIVE_LANE_MASK]]			; CHECK-TF-ONLYRED: %[[LOAD:.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float>{{.}} %[[ACTIVE_LANE_MASK]]
	; CHECK-TF-ONLYRED: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]			; CHECK-TF-ONLYRED: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
	; CHECK-TF-ONLYRED: %[[SEL:.*]] = select fast <vscale x 4 x i1> %[[ACTIVE_LANE_MASK]], <vscale x 4 x float> %[[ADD]], <vscale x 4 x float> %[[VEC_PHI]]			; CHECK-TF-ONLYRED: %[[SEL:.*]] = select fast <vscale x 4 x i1> %[[ACTIVE_LANE_MASK]], <vscale x 4 x float> %[[ADD]], <vscale x 4 x float> %[[VEC_PHI]]
	; CHECK-TF-ONLYRED: middle.block:			; CHECK-TF-ONLYRED: middle.block:
	; CHECK-TF-ONLYRED-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[SEL]])			; CHECK-TF-ONLYRED-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[SEL]])

				; CHECK-NEOVERSE-V1-LABEL: @fadd_red_fast
				; CHECK-NEOVERSE-V1: vector.body:
				; CHECK-NEOVERSE-V1-NOT: %{{.*}} = phi <vscale x 4 x i1>
				; CHECK-NEOVERSE-V1: %[[LOAD:.*]] = load <vscale x 4 x float>
				; CHECK-NEOVERSE-V1: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
				; CHECK-NEOVERSE-V1: middle.block:
				; CHECK-NEOVERSE-V1-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])

	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
	; CHECK-TF-ONLYRED: vector.body:			; CHECK-TF-ONLYRED: vector.body:
	; CHECK-TF-ONLYRED-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-TF-ONLYRED-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-TF-ONLYRED: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]			; CHECK-TF-ONLYRED: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]
	; CHECK-TF-ONLYRED: %[[LOAD]] = load <vscale x 4 x i32>			; CHECK-TF-ONLYRED: %[[LOAD]] = load <vscale x 4 x i32>
	; CHECK-TF-ONLYRED: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)			; CHECK-TF-ONLYRED: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)
	; CHECK-TF-ONLYRED: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]			; CHECK-TF-ONLYRED: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
	; CHECK-TF-ONLYRED: store <vscale x 4 x i32> %[[ADD]]			; CHECK-TF-ONLYRED: store <vscale x 4 x i32> %[[ADD]]

				; CHECK-NEOVERSE-V1-LABEL: @add_recur
				; CHECK-NEOVERSE-V1: entry:
				; CHECK-NEOVERSE-V1: %[[PRE:.]] = load i32, i32 %src, align 4
				; CHECK-NEOVERSE-V1: vector.ph:
				; CHECK-NEOVERSE-V1: %[[RECUR_INIT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[PRE]]
				; CHECK-NEOVERSE-V1: vector.body:
				; CHECK-NEOVERSE-V1-NOT: %{{.*}} = phi <vscale x 4 x i1>
				; CHECK-NEOVERSE-V1: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]
				; CHECK-NEOVERSE-V1: %[[LOAD]] = load <vscale x 4 x i32>
				; CHECK-NEOVERSE-V1: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)
				; CHECK-NEOVERSE-V1: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
				; CHECK-NEOVERSE-V1: store <vscale x 4 x i32> %[[ADD]]

	entry:			entry:
	%.pre = load i32, i32* %src, align 4			%.pre = load i32, i32* %src, align 4
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%0 = phi i32 [ %1, %for.body ], [ %.pre, %entry ]			%0 = phi i32 [ %1, %for.body ], [ %.pre, %entry ]
	%i.010 = phi i64 [ %add, %for.body ], [ 0, %entry ]			%i.010 = phi i64 [ %add, %for.body ], [ 0, %entry ]
	%add = add nuw nsw i64 %i.010, 1			%add = add nuw nsw i64 %i.010, 1
	Show All 29 Lines
	; CHECK-TF-NORED: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK-TF-NORED: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

	; CHECK-TF-NOREC-LABEL: @interleave(			; CHECK-TF-NOREC-LABEL: @interleave(
	; CHECK-TF-NOREC: vector.body:			; CHECK-TF-NOREC: vector.body:
	; CHECK-TF-NOREC: %[[LOAD:.*]] = load <8 x float>, <8 x float>			; CHECK-TF-NOREC: %[[LOAD:.*]] = load <8 x float>, <8 x float>
	; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK-TF-NOREC: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

				; CHECK-NEOVERSE-V1-LABEL: @interleave(
				; CHECK-NEOVERSE-V1: vector.body:
				; CHECK-NEOVERSE-V1: %[[LOAD:.*]] = load <8 x float>, <8 x float>
				; CHECK-NEOVERSE-V1: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK-NEOVERSE-V1: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%i.021 = phi i64 [ %inc, %for.body ], [ 0, %entry ]			%i.021 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
	%mul = shl nuw nsw i64 %i.021, 1			%mul = shl nuw nsw i64 %i.021, 1
	%arrayidx = getelementptr inbounds float, float* %src, i64 %mul			%arrayidx = getelementptr inbounds float, float* %src, i64 %mul
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	Show All 27 Lines