This is enabling tail-folding for SVE. As you know, tail-folding has great potential to improve codegen by not having to emit a vector + epilogue loop, runtime checks for this, and also some setup code for the vector loop. This can help performance significantly in some cases.
I have added WIP (work-in-progress) to the subject as I am looking into collecting some more performance numbers and wanted to get your input while I am doing that.
My results so far on a 2x256b SVE implementation:
- 5% uplift for X264 (SPEC INT 2017)
- Neutral for the other apps in SPECINT2017.
- 1% uplift on an embedded benchmark. It's not a very representative workload, but it has a few matrix kernels and this 1% is significant for that benchmark, nicely illustrating benefits of tail-folding.
- I've tried the llvm test-suite, but just trying to generate a baseline shows it's really noisy. I haven't yet tried with tail-folding, because I am not sure I can conclude anything from the numbers.
What I will do next is getting numbers for SPEC FP 2017.
This change enables the "simple" tail-folding, so isn't e.g. dealing with reductions/recurrences. This seemed like a first good step to me while we get more experience with this. I am interested to hear if you have suggestions for workloads or cases that I should check.