Page MenuHomePhabricator

[LoopUnroll] Don't unroll before vectorisation
Needs RevisionPublic

Authored by SjoerdMeijer on May 19 2021, 1:19 AM.

Details

Summary

The loop vectoriser is sandwiched between two loopunroll invocations in the optimisation pipeline and this removes the first one. The motivation is that (fully) unrolling loops early removes opportunities for the loop vectoriser. This is often the case for loops with constant loop bounds and relatively small iteration counts for which vectorisation is still very much profitable. After fully unrolling these sort of loops, the SLP vectoriser is not always able to compensate for this, or is not (yet) as effective as the loop vectoriser. Therefore, first performing loop vectorisation, unrolling, SLP vectorisation seems a better approach.

There are a quite a few of these cases in x264 in SPEC, like this one which GCC loop vectorises and we don't which is the reason why we are behind quite a lot:

for( int i = 0; i < 16; i++ )
  if ((dct[i]) > 0 )
    dct[i] = (bias[i] + (dct[i])) * (mf[i]) >> 16;
  else
    dct[i] = - ((bias[i] - (dct[i])) * (mf[i]) >> 16);

But this is also a bit of an old problem, and at least the following PRs are related: PR47178, PR47726, PR47554, PR47436, PR31572, PR47553, PR47491.

Some first performance numbers with patch, where + is a performance improvement and - is a regression:

AArch64 (neoverse-n1)
500.perlbench_r +0.34%
502.gcc_r i     -0.28%
505.mcf_r       -0.60%
520.omnetpp_r   +0.585
523.xalancbmk_r +1.68%
525.x264_r      +1.33%
531.deepsjeng_r +0.29%
541.leela_r     -0.54%
557.xz_r         0.00%

And for some embedded benchmarks:

Thumb2 (Cortex-M55)
benchmark1  -0.21%
benchmark2    +0.06%
DSP      +0.02%

These numbers show and improvement where I would like to see it: x264. The uplift in xalancbmk is nice too, but I haven't analysed that one yet. The other numbers are show a little bit of up and down behaviour, but only very small, and overall cancelling out each other. I think these are really encouraging results, because it suggests we get the results where we want without any fallout. This picture was confirmed on a set of embedded benchmarks.

I am not really a fan of the llvm test suite as a performance benchmark (noisy), but will get some numbers for that too. And while I do that, and fix up a few llvm regression test cases (the ones that check optimisation pipeline order), I already wanted to share this to get some opinions on this.

Diff Detail

Unit TestsFailed

TimeTest
270 msx64 debian > Clang.CodeGen::pgo-sample-thinlto-summary.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/13.0.0/include -nostdsysteminc -O2 -fprofile-sample-use=/var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/Inputs/pgo-sample-thinlto-summary.prof /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/pgo-sample-thinlto-summary.c -emit-llvm -o - 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/pgo-sample-thinlto-summary.c -check-prefix=SAMPLEPGO
80 msx64 debian > LLVM.CodeGen/AMDGPU::extra-sroa-after-unroll.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -instnamer -O1 -mtriple=amdgcn-- -S -o - /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/extra-sroa-after-unroll.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=GCN,O1 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/extra-sroa-after-unroll.ll
130 msx64 debian > LLVM.Other::new-pm-defaults.ll
Script: -- : 'RUN: at line 10'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -disable-verify -verify-cfg-preserved=1 -debug-pass-manager -passes='default<O1>' -S /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-defaults.ll 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-defaults.ll --check-prefixes=CHECK-O,CHECK-DEFAULT,CHECK-O1,CHECK-NOEXT
100 msx64 debian > LLVM.Other::new-pm-thinlto-defaults.ll
Script: -- : 'RUN: at line 11'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -disable-verify -verify-cfg-preserved=0 -debug-pass-manager -passes='thinlto-pre-link<O1>' -S /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-thinlto-defaults.ll 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-thinlto-defaults.ll --check-prefixes=CHECK-O,CHECK-O1,CHECK-PRELINK-O,CHECK-PRELINK-O-NODIS
110 msx64 debian > LLVM.Other::new-pm-thinlto-postlink-pgo-defaults.ll
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -disable-verify -verify-cfg-preserved=0 -debug-pass-manager -passes='thinlto<O1>' -S /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll --check-prefixes=CHECK-O,CHECK-O1,CHECK-NOEXT
View Full Test Results (14 Failed)

Event Timeline

SjoerdMeijer created this revision.May 19 2021, 1:19 AM
SjoerdMeijer requested review of this revision.May 19 2021, 1:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2021, 1:19 AM
SjoerdMeijer edited the summary of this revision. (Show Details)May 19 2021, 1:20 AM
lebedev.ri added a subscriber: lebedev.ri.

Thank you for looking into this!
This was bothering me for some time now, i'm also unsure if the current behavior is optimal.

You need to update pass structure tests, and it would be really good to add a phaseordering test.

nikic requested changes to this revision.May 19 2021, 1:28 AM
nikic added a subscriber: nikic.

The full unroll pass during function simplification is quite important for certain code patterns and serves a completely different purpose than the runtime unrolling that happens late in the pipeline. It is critical that full unrolling happens relatively early, so that scalar optimizations have a chance to work on the fully unrolled loop.

This revision now requires changes to proceed.May 19 2021, 1:28 AM

The full unroll pass during function simplification is quite important for certain code patterns and serves a completely different purpose than the runtime unrolling that happens late in the pipeline. It is critical that full unrolling happens relatively early, so that scalar optimizations have a chance to work on the fully unrolled loop.

This is a phase ordering issue, which is a difficult problem. :)

My numbers show that this is not a problem:

It is critical that full unrolling happens relatively early, so that scalar optimizations have a chance to work on the fully unrolled loop.

At least, for the codes that I have looked at. So the question is, do you have examples I can look at? Perhaps we can or need to reshuffle the pipeline a bit more after vectorisation? You've requested changes to this patch, but some suggestions how to move this forward would be appreciated.

As somewhat related data point, i'm currently looking at a case where
we completely unroll a loop that should have been later recognized by LoopIdiom pass.

nikic added a comment.May 19 2021, 1:49 AM

@SjoerdMeijer I expect you will find some test cases when you update phase ordering tests. A typical pattern is that a loop is working on a local array, gets fully unrolled, the array gets SROAd into individual array elements, mem2reg'd and then folds down to something much more efficient -- as if you didn't have a loop in the first place, and wrote out the unrolled code.

I occasionally see optimization failures in Rust which are caused by failure to fully unroll a loop at this pipeline position (because of weaknesses in unrolling). If the loop only gets unrolled in the late optimization pipeline, then that usually leaves behind "trivially optimizable" code. In some cases DAGCombine manages to pick up the slack (modulo dead stack allocations), but that's certainly not how it should work.

I don't have suggestion on how to move forward *this* patch, because I think it's fundamentally the wrong direction to take.

As somewhat related data point, i'm currently looking at a case where we completely unroll a loop that should have been later recognized by LoopIdiom pass.

LoopIdiom runs before full unrolling, so that seems odd?

SjoerdMeijer edited the summary of this revision. (Show Details)May 19 2021, 1:49 AM

I don't have suggestion on how to move forward *this* patch, because I think it's fundamentally the wrong direction to take.

In ideal world, SLP Vectorizer should be smarter :)

Possible fix for many SLP failures is https://reviews.llvm.org/D28907, but well yeah, stalled work from 2017.

So it depends, if this patch (and decision) give us more improvements than regressions, we should go for it..

A typical pattern is that a loop is working on a local array, gets fully unrolled, the array gets SROAd into individual array elements, mem2reg'd and then folds down to something much more efficient -- as if you didn't have a loop in the first place, and wrote out the unrolled code.

Could we check if loop is vectorizable, if so, avoid early loop unrolling?

Instead of removing early full unroll pass (see @nikic's concerns), couldn't we run loop vectorizer before full unroll pass?

I don't have suggestion on how to move forward *this* patch, because I think it's fundamentally the wrong direction to take.

@nikic : Thanks, this is one reason why I wanted early feedback on this.

Like I wrote in the description of this patch, I believe unrolling before vectorisation is fundamentally the wrong approach. This seems to be supported by current numbers, but I appreciate we are talking about a handful of benchmarks. This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently. And like I wrote in a previous comment, now I am very much interested in counter examples.

Instead of removing early full unroll pass (see @nikic's concerns), couldn't we run loop vectorizer before full unroll pass?

I thought that the loopunroller that is still running after loop vectorisation, was doing the same (full) unrolling as the one I have removed here. That's apparently not the case, and it is on my list to check things here.
But moving the loop vectoriser before the full unroll pass is what I wanted to achieve, yes, so I would be interested in this approach if feasible.

lebedev.ri added a comment.EditedMay 19 2021, 2:35 AM

I don't have suggestion on how to move forward *this* patch, because I think it's fundamentally the wrong direction to take.

Same here.

@nikic : Thanks, this is one reason why I wanted early feedback on this.

Like I wrote in the description of this patch, I believe unrolling before vectorisation is fundamentally the wrong approach. This seems to be supported by current numbers, but I appreciate we are talking about a handful of benchmarks. This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently. And like I wrote in a previous comment, now I am very much interested in counter examples.

Instead of removing early full unroll pass (see @nikic's concerns), couldn't we run loop vectorizer before full unroll pass?

I thought that the loopunroller that is still running after loop vectorisation, was doing the same (full) unrolling as the one I have removed here. That's apparently not the case, and it is on my list to check things here.
But moving the loop vectoriser before the full unroll pass is what I wanted to achieve, yes, so I would be interested in this approach if feasible.

The current pipeline essentially consists of two parts - inliner pipeline,
which mostly simplifies things so we can best guess the inlining decisions,
and optimization pipeline. So i don't believe we can just run LV in the other pipeline.

This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently.

Right.
Also, SLP does not have the ability to add runtime checks*. So we blindly fully unroll loops, which are otherwise vectorizable with Loop Vectorizer with RT checks, and then we miss many vectorization opportunities.

I really would like to see loop vectorizer before SLP.

  • See PR40976
fhahn added a comment.May 19 2021, 2:46 AM

I believe unrolling before vectorisation is fundamentally the wrong approach.

It is indeed suboptimal for a subset of loops which can be vectorized by LV and the SLP.

But in a lot other cases, early unrolling enables other passes to perform many additional simplifications as @nikic mentioned and I think it is very easy to come up with examples to show that, because a lot of simplification passes don't work well on loops. Just one example below. With early unrolling, LLVM will eliminate the memset, without early unrolling it won't. I would expect simplifications due to early unrolling to be quite helpful for a lot of general non-benchmark code with few vectorizable loops.

void foo(char *Ptr) {
    memset(Ptr, 0, 16);

    for (unsigned i = 0; i < 16; i++ )
      Ptr[i] = i+1;
}

This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently

One major difference is that LV can use runtime checks to ensure memory access do not alias. I suspect that's the main issue blocking SLP in the case in the description.

SjoerdMeijer added a comment.EditedMay 19 2021, 3:38 AM

I believe unrolling before vectorisation is fundamentally the wrong approach.

It is indeed suboptimal for a subset of loops which can be vectorized by LV and the SLP.

But in a lot other cases, early unrolling enables other passes to perform many additional simplifications as @nikic mentioned and I think it is very easy to come up with examples to show that, because a lot of simplification passes don't work well on loops. Just one example below. With early unrolling, LLVM will eliminate the memset, without early unrolling it won't. I would expect simplifications due to early unrolling to be quite helpful for a lot of general non-benchmark code with few vectorizable loops.

void foo(char *Ptr) {
    memset(Ptr, 0, 16);

    for (unsigned i = 0; i < 16; i++ )
      Ptr[i] = i+1;
}

I do see your point, but the funny thing is that this gives exactly the same codegen for this example (just a load and a store).

With this patch:

  • The loop gets vectorised 16 wide (i.e., it has one iteration)
  • Then the unroller comes along, completely unrolls this, so we get rid of the loop,
  • The memset is expanded in instruction selection, and then things get combined away to a load/store.

Before:

  • The loop gets fully unrolled early, so we have the memset and a block with the loop fully unrolled,
  • GlobalOptPass comes along: and sees the memset is globally dead and removes it: what remains is just the unrolled block.
  • The SLP vectoriser kicks in and vectorises this block.

All with the same result. So in a way this is an advertisement for skipping the fully unroller early. But like I said, I understand the point, and it was not my intention to skip fully unrolling, I just wanted it after the loop vectoriser.
Also, I was expecting that if was a terrible idea, I would have expected this to be flagged up by SPEC as it contains some different codes; but fair enough, I have run only SPEC and the embedded benchmarks.

This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently

One major difference is that LV can use runtime checks to ensure memory access do not alias. I suspect that's the main issue blocking SLP in the case in the description.

Yep, you're exactly right here. But this is quite important here that makes all the difference.

fhahn added a comment.May 19 2021, 4:22 AM

All with the same result. So in a way this is an advertisement for skipping the fully unroller early. But like I said, I understand the point, and it was not my intention to skip fully unrolling, I just wanted it after the loop vectoriser.
Also, I was expecting that if was a terrible idea, I would have expected this to be flagged up by SPEC as it contains some different codes; but fair enough, I have run only SPEC and the embedded benchmarks.

Fair enough, this is one of the simple cases where the backend picks up the slack from the middle-end (as @nikic mentioned), but I think we should focus on the IR we hand off to the backend, because the backend won't be able to optimize slightly more complex variations.

With a few small tweaks to the example, the backend is not able to pick up the slack (at least AArch64):

#include <string.h>

void use(char *);

void foo(int x) {
  char Ptr[16];
  memset(Ptr, 0, 16);
  if (x == 20)
    Ptr[5] = 10;
  for (unsigned i = 0; i < 16; i++ )
    Ptr[i] = i+1;
  use(&Ptr[0]);
}

Another different example that should also generate worse assembly:

#include <string.h>

void f(char*);
void bar();

void test(char *Ptr, int x) {
  char Foo[16];
  memset(Foo, 0, 16);
  for (unsigned i = 0; i < 16; i++ ) {
    Foo[i] = i+1;
    bar();
  }

  for (unsigned i = 0; i < 16; i++ ) {
    Ptr[i] = Foo[i] + 2;
  }
}

Those are just a few variations focused on DSE. I'd expect that similar issues exist for other passes, like GVN, InstCombine & co.

Thanks for those examples. I am going to play a little bit more with this.

In terms of a way forward (and getting the example properly vectorised):

  • I am going to try and move the vectoriser passes to just before the full unroller. I am somewhat sceptical, i.e. expecting a lot of fall out, but let's see.
  • I don't think I see any potential in tuning the full unroller and let it unroll less because it sounds fragile.
  • and I don't think there's an easy way forward with this in the SLP vectoriser.

So what remains is playing with moving the vectoriser earlier. But more/other ideas are more than welcome.

Not directly on topic for this review, but looking at some of the examples given, I'd point out that these are simply missed optimizations. If we're given vectorized IR and not performing other simple optimization (loop idiom, dce, etc..), those are optimization bugs we should probably fix.

It also sounds like it might be worth having the vectorizer recognize the case where it can vectorize all iterations in a single vector iteration and break the backedge if so. This would avoid the need for another pass of loop deletion or unrolling before scalar opts could easily kick in.

If we improved the robustness of our other optimizations w.r.t. vector instructions, we'd probably be a lot less sensitive to the proposed pass reordering.

xbolva00 added a subscriber: dtemirbulatov.EditedMay 19 2021, 11:30 AM

If we improved the robustness of our other optimizations w.r.t. vector instructions, we'd probably be a lot less sensitive to the proposed pass reordering.

true, but the main problem here is SLP vectorizer & missed vectorizations.

If https://reviews.llvm.org/D28907 is finished and commited & SLP gains ability to emit RT checks, I think these both improvements in SLP could fix quite a lot missed vectorization opportunities.

cc @dtemirbulatov

Not directly on topic for this review, but looking at some of the examples given, I'd point out that these are simply missed optimizations. If we're given vectorized IR and not performing other simple optimization (loop idiom, dce, etc..), those are optimization bugs we should probably fix.

It also sounds like it might be worth having the vectorizer recognize the case where it can vectorize all iterations in a single vector iteration and break the backedge if so. This would avoid the need for another pass of loop deletion or unrolling before scalar opts could easily kick in.

If we improved the robustness of our other optimizations w.r.t. vector instructions, we'd probably be a lot less sensitive to the proposed pass reordering.

Thanks @reames , this is an interesting view, which I agree with.

In terms of how to progress this best, this is probably the path of least resistance as it involves fixing up one pass as opposed to a pass reordering + fixing several:

true, but the main problem here is SLP vectorizer & missed vectorizations.

If https://reviews.llvm.org/D28907 is finished and commited & SLP gains ability to emit RT checks, I think these both improvements in SLP could fix quite a lot missed vectorization opportunities.

So I think I am going to look into this first.

fhahn added a comment.May 20 2021, 2:52 AM

Not directly on topic for this review, but looking at some of the examples given, I'd point out that these are simply missed optimizations. If we're given vectorized IR and not performing other simple optimization (loop idiom, dce, etc..), those are optimization bugs we should probably fix.

It also sounds like it might be worth having the vectorizer recognize the case where it can vectorize all iterations in a single vector iteration and break the backedge if so. This would avoid the need for another pass of loop deletion or unrolling before scalar opts could easily kick in.

If we improved the robustness of our other optimizations w.r.t. vector instructions, we'd probably be a lot less sensitive to the proposed pass reordering.

Thanks @reames , this is an interesting view, which I agree with.

In terms of how to progress this best, this is probably the path of least resistance as it involves fixing up one pass as opposed to a pass reordering + fixing several:

I think the main problem in the examples I shared is that we don't run various passes like DSE after late unrolling. Independent of any pass-ordering issues, it would be good for DSE to be able to analyze & detect loops that initialize/overwrite a memory range.

true, but the main problem here is SLP vectorizer & missed vectorizations.

If https://reviews.llvm.org/D28907 is finished and commited & SLP gains ability to emit RT checks, I think these both improvements in SLP could fix quite a lot missed vectorization opportunities.

So I think I am going to look into this first.

I put up a sketch to get a discussion started on how to best support memory runtime check generation for SLP: D102834

Matt added a subscriber: Matt.May 20 2021, 8:00 AM

And there is something weird in terms of cost modelling between vectorizers.

See https://bugs.llvm.org/show_bug.cgi?id=47491

If we dont unroll the loop, llvm loop vectorizer vectorizes the loop (cost model says it is profitable).

If we unroll the loop, slp does not vectorize the unrolled loop.

And there is something weird in terms of cost modelling between vectorizers.

See https://bugs.llvm.org/show_bug.cgi?id=47491

If we dont unroll the loop, llvm loop vectorizer vectorizes the loop (cost model says it is profitable).

If we unroll the loop, slp does not vectorize the unrolled loop.

And other one: https://bugs.llvm.org/show_bug.cgi?id=50552 ... SLP just misses currently these patterns.

Thanks for raising that one. It feels like the number of missed opportunities is endless. From the discussion here however, my impression was that there's very little appetite to skip the full unroller at this point and that running the LV before the full unroller would involve quite some surgery in the optimisation pipeline. My take away was that trying to improve the SLP was probably more low hanging fruit. Thus, I was thinking about abandoning this change....

And there is something weird in terms of cost modelling between vectorizers.

See https://bugs.llvm.org/show_bug.cgi?id=47491

If we dont unroll the loop, llvm loop vectorizer vectorizes the loop (cost model says it is profitable).

If we unroll the loop, slp does not vectorize the unrolled loop.

And other one: https://bugs.llvm.org/show_bug.cgi?id=50552 ... SLP just misses currently these patterns.

But yeah, can be fixed with @fhahn's D102834