Page MenuHomePhabricator

[LV] Allow vectorization of loops with induction post-inc expressions
AbandonedPublic

Authored by kuhar on Sep 10 2015, 8:59 AM.

Details

Reviewers
None
Summary

This patch teaches the LoopVectorizer not to bailout on loops with induction post-inc expressions.
Example:

for (int i = x; i < n; ++i) *ptr++ = *p++;
puts(ptr); // outside use

Previously the vectorizer wasn't able to vectorize similar code.

Diff Detail

Repository
rL LLVM

Event Timeline

kuhar updated this revision to Diff 34450.Sep 10 2015, 8:59 AM
kuhar retitled this revision from to Allow vectorization of loops with induction post-inc expressions.
kuhar updated this object.
kuhar set the repository for this revision to rL LLVM.
kuhar added a subscriber: llvm-commits.
kuhar retitled this revision from Allow vectorization of loops with induction post-inc expressions to [LV] Allow vectorization of loops with induction post-inc expressions.Sep 10 2015, 9:08 AM

Hi Jakub,

How did you end up with such a loop? I see the issue when I run vectorizer on your IR tests, but I can't reproduce the issue with the source C code. I wonder if this patch actually just papers over a problem in another pass (IndVarSimplify?). Could you share more details on this please?

Thanks,
Michael

kuhar added a comment.Sep 11 2015, 2:20 AM

Hello Michael,

here is complete C code that is being vectorized now and wasn't previously. One could probably reduce it further, but this version preserves the main idea:

#include <stdlib.h>
 
#define BUFF_SIZE 4096

void __attribute__((noinline))
    foo(int y, char *restrict src, char *restrict dest) {
  char *p = src;
  char *ptr = dest;
  int n = (y - 1) < (dest + BUFF_SIZE - ptr - 1)
              ? (y - 1)
              : (dest + BUFF_SIZE - ptr - 1);
  for (int vv = 0; vv < n; ++vv)
    *ptr++ = *p++;

  *ptr++ = *p++;

  printf("%d\n", ptr);
}

int main() {
  char S[] = "01234567890abcdefghij123456789abcdefghij0123456789XXX";
  char D[128] = { 0 };
  foo(33, S, D);
  printf("%s\n", D);
}

I need this patch to perform some other optimizations (more advanced loop unrolling/vectorizing).

This patch triggers in real-life code. For example, it showed 24.54% improvement in lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan and 7.60% improvement in lnt.SingleSource/Benchmarks/BenchmarkGame/recursive on A57 in aarch64.

Hi Jakub,

I was able to reproduce the original issue with the test you posted, thanks! However, it still looks like the problem isn't in vectorizer. E.g. if you replace int n = y < BUFF_SIZE ? (y - 1) : (BUFF_SIZE - 1) (which is equivalent to the expression in your test) with something else, like int n = y or int n = y*7%19, the loop would be vectorized. Looking at dumps, I see that earlier passes (starting from MergedLoadStoreMotion) behave differently with these value of upper bound, which doesn't look right, and we should fix those passes, not vectorizer.

Thanks,
Michael

kuhar added a comment.Sep 14 2015, 4:18 AM

Hi Michael,

I followed your comments and did some investigation on the difference between the IR with int n = y < BUFF_SIZE ? (y - 1) : (BUFF_SIZE - 1) and with int n = (y*9)%17. As you mentioned, they start to differ after IndVarSimplify pass - with the second expression (y*9)%17 it replaces a phi node with gep generated by SCEVExpander. It's not expensive, because its essentially a reuse of previous value (%rem).

When it comes to the first code, it generates such BB when entering IndVarSimplify:

entry:
  %cmp = icmp slt i32 %y, 4096
  %sub = add nsw i32 %y, -1
  %cond = select i1 %cmp, i32 %sub, i32 4095
  %cmp1.4 = icmp sgt i32 %cond, 0
  br i1 %cmp1.4, label %for.body.lr.ph, label %for.end

...

for.end:                                          ; preds = %for.cond.for.end_crit_edge, %entry
  %ptr.0.lcssa = phi i8* [ %incdec.ptr2, %for.cond.for.end_crit_edge ], [ %dest, %entry ]

The problem is that the SCEV for backedge taken count looks like this: (1 + (zext i32 (-3 + (-1 * (-4097 smax (-1 + (-1 * %y))))<nsw>) to i64) + %dest). IndVarSimplify makes sure that it would be cheap to expand it with SCEVExpander and here the real problems start. SCEVExpander thinks that smax is a costly operation, so this whole expression is also considered costly.
I don't know how hard it'd be to make IndVarSimplify or SCEVExpander to look around the function to find an already generated equivalent expression, but I don't think it'd be trivial. IndVarSimplify certainly doesn't know anything about loop vectorization, so it's not able to determine that even though some SCEV is expensive to expand, it would be beneficial to do so because of speed-up caused by later vectorization.

I'm not yet convinced that it'd be better to follow this patch of fixing IndVarSimplify - performing this optimization in LoopVectorizer is quite easy and doesn't seem hacky to me.

Hi Jakub,

The problem with the current approach is that it only fixes one case, that is caused by a different problem (as you showed - the actual problem is in IndVarSimplify/SCEVExpander). We fix consequences, not the rootcause itself. Maybe next week someone will come up with a patch, that would fix the same issue in LoopUnroller, and next week - in LoopInterchange or any other loop transformation.

It might be not-trivial to fix it in SCEV/IndVarSimplify, but I think it should be possible, and that's what we need to try first. For this case we just need to teach SCEV/IndVarSimplify to take into account loop invariants. E.g. the entire expression (zext i32 (-3 + (-1 * (-4097 smax (-1 + (-1 * %y))))<nsw>) to i64) is a loop invariant, so it shouldn't have high cost.

The fix in vectorizer might be simple, but I believe it's a (small) step in a wrong direction.

Michael

kuhar added a comment.EditedSep 16 2015, 6:18 AM

Hi Michael,

I've spent the last two days trying to come up with a IndVarSimplify patch that would enable the unmodified LoopVectorizer to work on my code.
The problem is that currently IndVarSimplify::RewriteExitValues explicitly works only on expressions which are loop invariants and there is a cost check inside.
When I tired commenting out this HighCost check, benchmark scores were not great: the only benchmark significantly improved was automotive-susan (~ 21%), and there were many regressions, ex. (all the scores on A57/aarch64):
lnt.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk 27.03%
lnt.MultiSource/Benchmarks/sim/sim 8.96%
lnt.MultiSource/Benchmarks/Olden/bh/bh 6.13%
lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode 6.07%
lnt.SingleSource/Benchmarks/Stanford/Puzzle 5.76%
lnt.MultiSource/Benchmarks/Prolangs-C++/ocean/ocean 5.34%
lnt.MultiSource/Applications/lemon/lemon 4.84%
lnt.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl 4.73%
lnt.MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 4.15%
There were also some severe regressions on our internal benchmarks.

I tried to come up with some heuristic approach like recursively counting the number of expensive operations coming SCEV expansion and comparing it with the number of instructions in the whole loop. After playing with it for some time I managed to get rid of these most serious regressions in lnt - the only (serious) ones left were:
lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode 6.14%
lnt.MultiSource/Benchmarks/Olden/bh/bh 5.39%
lnt.SingleSource/Benchmarks/BenchmarkGame/recursive 3.44%

The problem was that there were also not as many (little) improvements and that there were still some serious regressions in out benchmarks. Another thing is that even in the most aggressive configuration (a.k.a. always assume cheap expansion) I was not able to reproduce improvements coming form my original LoopVectorizer patch. I suspect that the difference comes from the fact that IndVarSimplify actually creates some new instructions, and my changes in the LoopVectorizer only reuse some existing values.

I think that the problem is just that in IndVarSimplify it's to early to determine if it's beneficial to rewrite some exit values and that it leads to new instructions being emitted - it's so easy to regress some stuff. Maybe some other optimization pass (like recently discussed LEV) could do some cleanup after loop vectorization, I'm not sure... Anyway, I think that the biggest argument for patching the LoopVectorizer is that it can regresses code very little (from what I've seen running multiple benchmarks).
Do you have some other ideas on performing the changes in IndVarSimplify, Michael?

Cheers,
Jakub

Hi Jakub,

I also spent some time in debugger today, and now I'm convinced even more that we should try to handle it in IndVarSimplify/SCEV. As you correctly noted, the problem is that IndVarSimplify thinks that smax expression is expensive to expand. While it's indeed expensive to expand smax/smin, we don't need to do that in our case, as this particular expansion already exists in our code - variable %n holds this value. Moreover, SCEV already tries to find existing expansion (see SCEVExpander::findExistingExpansion), but fails in our case. The issue we should be solving here is how to make SCEVExpander not fail.

Now, how does it try to find existing expansion, and why does it fail? The logic is pretty simple here: we examine all checks in exiting blocks of our loop - LHS and RHS of these comparisons already have existing expansions by definition (they are the expansions). If their SCEV expression happens to be exactly the same as the one we're looking for (S) - we're done! But here is where our problem resides. In our case we are looking for the following expression:

(lldb) p S->dump()
(-3 + (-1 * (-4097 smax (-1 + (-1 * %y))))<nsw>)

But the most similar one that we have is this one:

(lldb) p SE.getSCEV(RHS)->dump()
(-2 + (-1 * (-4097 smax (-1 + (-1 * %y))))<nsw>)

So, we fail to see that we have almost the same existing one, so we think that we need to expand it from scratch - and that's expensive, so we bail out. Please note though that if we for some reason rewrote this expression as (1+ -3 + (-1 * (-4097 smax (-1 + (-1 * %y))))<nsw>), the current logic would catch it.

With that in mind, I tried the following patch:

diff --git a/lib/Analysis/ScalarEvolutionExpander.cpp b/lib/Analysis/ScalarEvolutionExpander.cpp
index ed7386b..e4af9dc 100644
--- a/lib/Analysis/ScalarEvolutionExpander.cpp
+++ b/lib/Analysis/ScalarEvolutionExpander.cpp
@@ -1829,10 +1829,16 @@ Value *SCEVExpander::findExistingExpansion(const SCEV *S,
                     TrueBB, FalseBB)))
       continue;

-    if (SE.getSCEV(LHS) == S && SE.DT.dominates(LHS, At))
+    const SCEV *LHSSE = SE.getSCEV(LHS);
+    if (LHSSE->getType() == S->getType() &&
+        isa<SCEVConstant>(SE.getMinusSCEV(LHSSE, S)) &&
+        SE.DT.dominates(LHS, At))
       return LHS;

-    if (SE.getSCEV(RHS) == S && SE.DT.dominates(RHS, At))
+    const SCEV *RHSSE = SE.getSCEV(RHS);
+    if (RHSSE->getType() == S->getType() &&
+        isa<SCEVConstant>(SE.getMinusSCEV(RHSSE, S)) &&
+        SE.DT.dominates(RHS, At))
       return RHS;
   }

The idea is that when an expression only differs by a constant from an existing one, we can say that it also exists (to be precise, there is a cheap way to expand it). However, I think it might be incorrect in some corner cases, so I'd like to check it with SCEV experts first. But anyway, even if it's not correct in this particular form, I think it should be possible to make it correct if we add some sanity checks, and I believe that's the way to go.

Thanks,
Michael

PS: I haven't done any testing of this patch except on the provided test case, which was successfully vectorized.

The idea is that when an expression only differs by a constant from an existing one, we can say that it also exists (to be precise, there is a cheap way to expand it). However, I think it might be incorrect in some corner cases, so I'd like to check it with SCEV experts first. But anyway, even if it's not correct in this particular form, I think it should be possible to make it correct if we add some sanity checks, and I believe that's the way to go.

I agree that this is a good thing to do. Can you post the patch for this? I'd like to get Sanjoy to look at this.

That having been said, do you have a theoretical argument that IndVarSimplify can always fix this?

Hi Hal,

I agree that this is a good thing to do. Can you post the patch for this? I'd like to get Sanjoy to look at this.

Actually, Sanjoy has already commented on the patch I posted above on the mailing list, but it didn't get to phabricator for some reason. I can create a separate phab-review for this patch, but I'd like to run some testing before that, which I hope I can do next week.

That having been said, do you have a theoretical argument that IndVarSimplify can always fix this?

Nope, I can't prove that. However, if this will fix all the cases we care about, then I'd rather not add new entities to vectorizer - it's already complicated enough.

Thanks,
Michael

Hi Hal,

I agree that this is a good thing to do. Can you post the patch for this? I'd like to get Sanjoy to look at this.

Actually, Sanjoy has already commented on the patch I posted above on the mailing list, but it didn't get to phabricator for some reason. I can create a separate phab-review for this patch, but I'd like to run some testing before that, which I hope I can do next week.

Great, thanks!

That having been said, do you have a theoretical argument that IndVarSimplify can always fix this?

Nope, I can't prove that. However, if this will fix all the cases we care about, then I'd rather not add new entities to vectorizer - it's already complicated enough.

I doubt it is complicated enough ;) -- but that does not necessarily mean it needs this enhancement.

Thanks,
Michael

Hi everyone,

Just ot give an update on this: I tried the patch I posted before, but I'm not yet satisfied with it. The problem seems to be that we're detecting similar expressions now, but we're not reusing them. For the record, here is the test I'm playing with:

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

; Function Attrs: noinline nounwind ssp uwtable
define i8* @foo(i32 %y, i8* noalias %src, i8* noalias %dst) {
entry:
  %cmp = icmp slt i32 %y, 4096
  %sub = add nsw i32 %y, -1
  %tripcount = select i1 %cmp, i32 %sub, i32 4095
  %loop.entry.cond = icmp sgt i32 %tripcount, 0
  br i1 %loop.entry.cond, label %loop.ph, label %loop.exit

loop.ph:                                   ; preds = %entry
  br label %loop.body

loop.body:                                         ; preds = %loop.body, %loop.ph
  %iv = phi i32 [ 0, %loop.ph ], [ %iv.next, %loop.body ]
  %src.iv = phi i8* [ %src, %loop.ph ], [ %src.iv.next, %loop.body ]
  %dst.iv = phi i8* [ %dst, %loop.ph ], [ %dst.iv.next, %loop.body ]

  %tmp = load i8, i8* %src.iv, align 1
  store i8 %tmp, i8* %dst.iv, align 1

  %src.iv.next = getelementptr inbounds i8, i8* %src.iv, i64 1
  %dst.iv.next = getelementptr inbounds i8, i8* %dst.iv, i64 1
  %iv.next = add nsw i32 %iv, 1

  %loop.cond = icmp slt i32 %iv.next, %tripcount
  br i1 %loop.cond, label %loop.body, label %loop.exit

loop.exit:                                 ; preds = %loop.exit, %entry
  %src.iv.lcssa = phi i8* [ %src.iv.next, %loop.body ], [ %src, %entry ]
  ret i8* undef
}

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 2, !"Debug Info Version", i32 3}
!1 = !{i32 1, !"PIC Level", i32 2}
!2 = !{!"clang version 3.8.0 (trunk 247767) (llvm/trunk 247769)"}

What is expected from indvars on this test is to rewrite %src.iv.lcssa with a loop-invariant value. The patch does that, but it fills the loop preheader with code similar to what we have in entry: basic block, which seems unnecessary. My plan is to investigate it further, and hopefully fix it.

Michael

Hi Jakub and others,

As I mentioned before, my previous patch helped vectorizing the original test, but in the process we duplicated some code into the loop preheader. That happened because I fixed findExistingValues function to look for similar expression too, but I didn't fix SCEVExpander::expand to behave correspondingly. That's a bit bigger work to do than I thought before, so I have to postpone it until I have more time. I'm still sure that it's a right approach and eventually I hope to do it if no one else does that before. Jakub, do you have any plans on trying to implement this?

I filed a PR for this: https://llvm.org/bugs/show_bug.cgi?id=24920

Thanks,
Michael

kuhar added a comment.Sep 24 2015, 8:42 AM

Hello,

Jakub, do you have any plans on trying to implement this?

If no one picks it up, then I'd like do it, but it won't be earlier than in 1 or 2 weeks from now - tomorrow is the last day of my internship and I'm returning to my university next week.

kuhar abandoned this revision.Jul 9 2017, 10:19 PM