This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Enable Tail-Folding. WIP

Authored by SjoerdMeijer on Nov 21 2022, 5:39 AM.



This is enabling tail-folding for SVE. As you know, tail-folding has great potential to improve codegen by not having to emit a vector + epilogue loop, runtime checks for this, and also some setup code for the vector loop. This can help performance significantly in some cases.

I have added WIP (work-in-progress) to the subject as I am looking into collecting some more performance numbers and wanted to get your input while I am doing that.

My results so far on a 2x256b SVE implementation:

  • 5% uplift for X264 (SPEC INT 2017)
  • Neutral for the other apps in SPECINT2017.
  • 1% uplift on an embedded benchmark. It's not a very representative workload, but it has a few matrix kernels and this 1% is significant for that benchmark, nicely illustrating benefits of tail-folding.
  • I've tried the llvm test-suite, but just trying to generate a baseline shows it's really noisy. I haven't yet tried with tail-folding, because I am not sure I can conclude anything from the numbers.

What I will do next is getting numbers for SPEC FP 2017.

This change enables the "simple" tail-folding, so isn't e.g. dealing with reductions/recurrences. This seemed like a first good step to me while we get more experience with this. I am interested to hear if you have suggestions for workloads or cases that I should check.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Nov 21 2022, 5:39 AM
Herald added a project: Restricted Project. · View Herald Transcript
SjoerdMeijer requested review of this revision.Nov 21 2022, 5:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2022, 5:39 AM

Hi @SjoerdMeijer, thanks for looking into this. I do actually already have a patch to enable this by default (, where the default behaviour is tuned according to the CPU. I think this is what we want because the profile will change according to what CPU you're running on - some CPUs may handle reductions better than others. The decision in this patch may be incorrect for 128-bit vector implementations. I also ran SPEC2k17 on a SVE-enabled CPU as well and I remember I saw a small (2-3%) regression in parest or something like that, which is one of the reasons I didn't push the patch any further. I also think it's really important to run a much larger set of benchmarks besides SPEC2k17 and collect numbers to show the benefits, since there isn't much vectorisation actually going on in SPEC2k17.

One of the major problems with the currrent tail-folding implementation is that we make the decision before doing any cost analysis in the vectoriser, which isn't great because we may be forcing the vectoriser to take different code paths to if we didn't tail-fold. Ideally what we really want is to move to a model where the vectoriser has a two-dimensional matrix of costs considering the combination of VF and vectorisation style (e.g. tail-folding vs whole vector loops, etc.), and choose the most optimal combination.

Ah, I didn't know about D130618!
I will reply on that ticket, that's probably best to keep things in one place?

Matt added a subscriber: Matt.Nov 21 2022, 3:09 PM
SjoerdMeijer abandoned this revision.Mar 17 2023, 1:39 AM