This is an archive of the discontinued LLVM Phabricator instance.

[LV] Don't widen trivial induction variables
ClosedPublic

Authored by mssimpso on Jun 22 2016, 1:32 PM.

Details

Summary

We currently always try to vectorize induction variables. However, if an induction variable is only used for counting loop iterations or computing addresses with getelementptr instructions, this doesn't really make sense. We only see a benefit from vectorizing induction variables if they are used in non-trivial ways. For example, here the induction variable is stored to memory:

for (int i = 0; i < n; ++i)
  a[i] = i;

Needlessly vectorizing causes us to generate unnecessary phi nodes, extracts, and other computation inside the loop. InstCombine can sometimes clean up the code we generate, but in general it cannot do so completely, especially when the unroll factor is greater than one (we create dependent step vectors that are difficult to simplify). It would be better if we didn't generate poor code to begin with.

This patch checks that induction variables are used in non-trivial ways before deciding to widen them. If an induction variable is only used for counting iterations or computing addresses, we scalarize it instead.

This change results in a static reduction in the number of instructions in nearly every benchmark in spec2000 and spec2006. In addition, we've observed significant performance improvements (> 1%) in several of them with no non-noise reductions. Experiments were conducted on Kryo.

Please take a look.

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 61599.Jun 22 2016, 1:32 PM
mssimpso retitled this revision from to [LV] Don't widen trivial induction variables.
mssimpso updated this object.
mssimpso added reviewers: mkuper, sbaranga, jmolloy, nadav.
mssimpso added subscribers: llvm-commits, mcrosier.
mkuper edited edge metadata.Jun 22 2016, 1:56 PM

Thanks a lot for cleaning up my mess, the original patch was probably rather overzealous. :-\
The cases I've looked at were nicely cleaned up by InstCombine, but I guess having it work in the general case was too much to expect.

lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

This looks a bit weird to me.Do you have a test-case where this works, but instcombine fails to simplify the regular step vector?
Perhaps it would be better to fix instcombine, I can look into it.

In any case, that shouldn't block the rest of the patch - (a) not widening the phi when the uses are trivial, and (b) using the scalarized step vector instead of the regular one should be independently good, right?

4145 ↗(On Diff #61599)

We are already doing something very similar in collectValuesToIgnore(), for basically the same reason - to estimate whether we'll end up needing a vector IV or not (and when InstCombine can't do the cleanup, my patch breaks that logic as well).

Can you use collectValuesToIgnore() instead? Either by checking whether all the phi users are in ValuesToIgnore, or by marking the PHI itself as "should/should not be widened" during collectValuesToIgnore(), if that makes sense?

Michael,

Thanks for the quick feedback. But I definitely wasn't trying to clean up after you! Your work creating vector phi's is certainly a step in the right direction. I just came across some examples where we were currently generating bad code (even before your changes). Please see my responses inline. Thanks!

Matt.

lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

This looks a bit weird to me.Do you have a test-case where this works, but instcombine fails to simplify the regular step vector?

Sure. Here's a simple example with interleaved access vectorization enabled.

In the test case below, if we use the regular step vector, we end up creating unneeded splat inserts and extracts inside the loop. A scalar induction variable is definitely preferable here since we don't do anything exciting with it.

; opt -S < %s -loop-vectorize -force-vector-width=2 -force-vector-interleave=2 -enable-interleaved-mem-accesses -instcombine
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"

%pair = type { i64, i64 }
define void @interleaved(%pair *%p, i64 %y, i64 %n) {
entry:
  br label %for.body

for.body:
  %i  = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
  %f1 = getelementptr inbounds %pair, %pair* %p, i64 %i, i32 1
  %r0 = load i64, i64* %f1, align 8
  %r3 = xor i64 %r0, %y
  store i64 %r3, i64* %f1, align 8
  %i.next = add nuw nsw i64 %i, 1
  %cond = icmp slt i64 %i.next, %n
  br i1 %cond, label %for.body, label %for.end

for.end:
  ret void
}

I've seen other cases with what I've been working on that lead me to believe fixing this kind of thing in InstCombine is not going to be that fruitful. Things get very ugly when interleaved access vectorization is enabled with conditional stores vectorization, for example. I think it's better to do the right thing at the outset.

4145 ↗(On Diff #61599)

Ah, right. I forgot we already checked for this in collectValuesToIgnore. Thanks for pointing this out! I will update the patch.

mkuper added inline comments.Jun 22 2016, 3:06 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

Sorry, I wasn't very clear.

I didn't mean an example of when a scalar IV is preferable. You're completely right about that.
I meant a case where, in addition to the scalar IV, we want the pre-scalarized step vector, because the vectorized one doesn't simplify. Or is this the same example?

mssimpso added inline comments.Jun 22 2016, 6:17 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

No, you were clear. This is the same example -- it doesn't simplify. Without the pre-scalarized step vector, we're left with the splats, extracts, etc. I will add it as a test case to the patch.

anemet added a subscriber: anemet.Jun 22 2016, 8:37 PM

Hi Matt,

This seems to be related to https://llvm.org/bugs/show_bug.cgi?id=27881#c6

I am also wondering if this is general goodness that should be in instcombine rather than adding more special-casing to the already complex vectorizer.

Adam

mkuper added inline comments.Jun 22 2016, 11:59 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

Ah, ok, I was confused because you wrote that "a scalar induction variable is definitely preferable".
Anyway, this does look like an InstCombine issue.
We get:

%broadcast.splatinsert = insertelement <2 x i64> undef, i64 %index, i32 0
%broadcast.splat = shufflevector <2 x i64> %broadcast.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer
%induction = add <2 x i64> %broadcast.splat, <i64 0, i64 1>
%induction1 = add <2 x i64> %broadcast.splat, <i64 2, i64 3>
%3 = extractelement <2 x i64> %induction, i32 0
%4 = getelementptr inbounds %pair, %pair* %p, i64 %3, i32 1
%5 = insertelement <2 x i64*> undef, i64* %4, i32 0
%6 = extractelement <2 x i64> %induction, i32 1
%7 = getelementptr inbounds %pair, %pair* %p, i64 %6, i32 1
%8 = insertelement <2 x i64*> %5, i64* %7, i32 1
%9 = extractelement <2 x i64> %induction1, i32 0
%10 = getelementptr inbounds %pair, %pair* %p, i64 %9, i32 1
%11 = insertelement <2 x i64*> undef, i64* %10, i32 0
%12 = extractelement <2 x i64> %induction1, i32 1
%13 = getelementptr inbounds %pair, %pair* %p, i64 %12, i32 1
%14 = insertelement <2 x i64*> %11, i64* %13, i32 1

InstCombine cleans this up to:

%broadcast.splatinsert = insertelement <2 x i64> undef, i64 %index, i32 0
%broadcast.splat = shufflevector <2 x i64> %broadcast.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer
%induction1 = add <2 x i64> %broadcast.splat, <i64 2, i64 3>
%3 = getelementptr inbounds %pair, %pair* %p, i64 %index, i32 1
%4 = or i64 %index, 1
%5 = getelementptr inbounds %pair, %pair* %p, i64 %4, i32 1
%6 = extractelement <2 x i64> %induction1, i32 0
%7 = getelementptr inbounds %pair, %pair* %p, i64 %6, i32 1
%8 = extractelement <2 x i64> %induction1, i32 1
%9 = getelementptr inbounds %pair, %pair* %p, i64 %8, i32 1
%10 = bitcast i64* %3 to <4 x i64>*
%11 = bitcast i64* %7 to <4 x i64>*

I don't immediately see why it should be able to handle %induction, but not %induction1.
Am I missing something obvious?

I am also wondering if this is general goodness that should be in instcombine rather than adding more special-casing to the already complex vectorizer.

Adam,

I'm not sure I agree with this. InstCombine/InstructionSimplify are also very complex, and if you look at my reply to Michael, simplifying these cases is not straightforward. I don't think we want, or can, have InstCombine be an "un-vectorizer". I just doesn't make sense to vectorize these trivial induction variables in the first place.

lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

InstCombine/InstructionSimplify is only able to simplify %induction because it knows element zero of the add is a noop:

%induction = add <2 x i64> %broadcast.splat, <i64 0, i64 1>

It can then replace the extract from element 0 with %index, and then the rest falls out. We can't do the same thing for %induction1, though. The add is not a noop. InstCombine would have to decide it would be beneficial to unvectorize actual computation, replacing a single vector add with 2 scalar adds. It doesn't seem like InstCombine is equipped to do this. And I think you would need a cost model for this, anyway.

mssimpso added inline comments.Jun 23 2016, 6:13 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2189 ↗(On Diff #61599)

I also should have mentioned that if you increase the vector factor to 4, %induction is not longer removed as well, for the reasons I mentioned above. The noop trick is only useful when there are two elements in the vector.

mssimpso updated this revision to Diff 61696.Jun 23 2016, 10:08 AM
mssimpso edited edge metadata.

Addressed Michael's initial comments.

  • Reused collectValuesToIgnore for determining if an induction variable has any non-trivial users. We were already doing this in the cost model.
  • Added another test case for the interleaved example we've been discussing.

Thanks!

Basically this LGTM (modulo some nits, inline), as long as you also file a bug for the InstCombine issue.
I'm also ok with landing only the scalar IV part of this for now, and trying to make a decision about getScalarizedStepVector vs. IC separately. Adam, what do you think?

lib/Transforms/Vectorize/LoopVectorize.cpp
4223 ↗(On Diff #61696)

Nit - can you perhaps rename ValuesToIgnore? Because we're not using it here to "ignore" anything, and it looks odd.

6352 ↗(On Diff #61696)

Note that there's a conflicting modification to this code planned by Wei: http://reviews.llvm.org/D20474

test/Transforms/LoopVectorize/induction.ll
110 ↗(On Diff #61696)

simply -> simplify

mkuper added a subscriber: wmi.Jun 23 2016, 11:45 AM

Adding Wei due to the possible conflict with http://reviews.llvm.org/D20474

mssimpso updated this revision to Diff 61818.Jun 24 2016, 12:28 PM

Addressed Michael's comments.

I'll wait for additional feedback from Adam. Thanks, Michael!

anemet added inline comments.Jun 27 2016, 9:51 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2191 ↗(On Diff #61818)

Matt, your argument about instcombine not being adequate for this is a convincing one.

On the other hand, I am still wondering if the solution is general enough. There should be cases where we'd be better off having *both* the vector and the scalar version of the same induction variable. I.e. depending on the use, you may want to use the vector version (e.g. store) or the scalar version (address calc, compares).

I am not saying that we need to necessarily implement this but the current solution should be taking us incrementally closer to the full solution.

mssimpso added inline comments.Jun 28 2016, 8:36 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2191 ↗(On Diff #61818)

Adam,

One way to handle the cases you mention would be to allow the vectorizer to generate both a scalar version and a vector version of the same induction variable. Then it could select whether to generate a use of the scalar or vector version based on the instruction (address calculation, store, etc.). DCE and Instcombine would presumably clean up afterwards.

The current patch does take us incrementally to that kind of solution. It handles the case where we know we will always prefer to use the scalar version. If that's the case, we generate the scalar steps; otherwise, we continue to generate the vector steps. It doesn't, however, generate the scalar version of the induction variable along side the vector version.

anemet added inline comments.Jun 28 2016, 10:09 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2191 ↗(On Diff #61818)

One way to handle the cases you mention would be to allow the vectorizer to generate both a scalar version and a vector version of the same induction variable. Then it could select whether to generate a use of the scalar or vector version based on the instruction (address calculation, store, etc.). DCE and Instcombine would presumably clean up afterwards.

Yes that is the direction I was thinking of going in the long term. (If the later passes are ineffective cleaning up unused versions we can also generate them on demand.)

The current patch does take us incrementally to that kind of solution. It handles the case where we know we will always prefer to use the scalar version. If that's the case, we generate the scalar steps; otherwise, we continue to generate the vector steps. It doesn't, however, generate the scalar version of the induction variable along side the vector version.

Fair enough.

I have one last high-level question. Does this fix the pointer arithmetic code in https://llvm.org/bugs/show_bug.cgi?id=27881#c6 ?

I will look at the specifics of the patch later today.

mssimpso added inline comments.Jun 28 2016, 11:23 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2191 ↗(On Diff #61818)

Does this fix the pointer arithmetic code in https://llvm.org/bugs/show_bug.cgi?id=27881#c6 ?

I think so. With this patch post-instcombine, we generate code like the following for the pointer arithmetic:

vector.body:
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = shl i64 %index, 3
  %3 = or i64 %offset.idx, 8
  %4 = or i64 %offset.idx, 16
  %5 = or i64 %offset.idx, 24
  %6 = getelementptr inbounds float, float* %a, i64 %offset.idx
  %7 = getelementptr inbounds float, float* %a, i64 %3
  %8 = getelementptr inbounds float, float* %a, i64 %4
  %9 = getelementptr inbounds float, float* %a, i64 %5

The bigger issue there seems to be the cost model though. The loop still builds up vectors with the loaded values that feed the vector fadd's.

Regarding the IR explosion mentioned in that bug pre-instcombine, this patch will not help. This is because I scalarize the arithmetic for the step vectors, but still insert the results into a vector, since the rest of the code expects the IV's to be vectors. (All the scalarization is handled in this way I believe). There's perhaps room for a compile-time optimization here where the vectorizer would avoid all the scalar-to-vector-to-scalar conversions.

I will look at the specifics of the patch later today.

Thanks, as always!

anemet added inline comments.Jun 28 2016, 11:33 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2191 ↗(On Diff #61818)

I think so. With this patch post-instcombine, we generate code like the following for the pointer arithmetic:

Great, thanks!

anemet added inline comments.Jun 29 2016, 1:55 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
308–314 ↗(On Diff #61818)

Please comment the new argument

316 ↗(On Diff #61818)

Since you're holding on to this in a member, it's less surprising if the reference is live throughout the lifetime of the InnerLoopVectorizer. I think that we should pass this in the ctor. I see no particular reason not to do this way.

420 ↗(On Diff #61818)

\p Step, please explain at least \p Index as well.

426 ↗(On Diff #61818)

same

4219–4233 ↗(On Diff #61818)

Looks like this and the code below under trunc are quite similar. Can we factor this is out to a helper in a prequel to this patch and then add the special case in the to the helper?

test/Transforms/LoopVectorize/induction.ll
70–80 ↗(On Diff #61818)

I think that the testcase from the PR I mentioned would better highlight the problem of vectorizing the induction variable. I.e. in this case we don't actually end up with a vector induction variable but a secondary unnecessary IV.

mssimpso added inline comments.Jun 30 2016, 7:01 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
316 ↗(On Diff #61818)

Would it make sense to just pass the Cost Model in the ctor? MinimumBitWidths comes from there as well. Actually, I'm not sure why we don't pass the Legality in the ctor either.

4219–4233 ↗(On Diff #61818)

I think so - I'll give it a shot and post for review.

test/Transforms/LoopVectorize/induction.ll
70–80 ↗(On Diff #61818)

I'll add that test case as well.

anemet added inline comments.Jun 30 2016, 11:14 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
316 ↗(On Diff #61818)

Good point. If you prefer you can leave VecValuesToIgnore as it is for now and clean all this up in follow-on patches.

mssimpso added inline comments.Jun 30 2016, 12:23 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4219–4233 ↗(On Diff #61818)

Refactoring done in D21903.

mssimpso updated this revision to Diff 62427.Jun 30 2016, 2:39 PM
mssimpso marked 4 inline comments as done.

Addressed Adam's comments.

  • Refactored the integer induction widening (D21903) and rebased.
  • Updated comments.
  • Added the test case from the PR mentioned in discussions.
lib/Transforms/Vectorize/LoopVectorize.cpp
316 ↗(On Diff #61818)

I think a follow-on is the way to go. I briefly looked at this, and there may be quit a bit to untangle. The cost model and legality have a lot of overlap, and we will want to be sure we don't unnecessarily duplicate anything.

anemet accepted this revision.Jul 1 2016, 10:52 AM
anemet edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jul 1 2016, 10:52 AM

I just started to look at D21903 but I am not sure I understand the order of the two patches. Ideally D21903 should have been a prequel to this but after reading the latest version of this patch I was expecting it to be the other way. But looking at D21903, that changes here are not present there. Can you please explain?

Thanks for the reviews Adam. D21903 is indeed the prequel refactoring patch, and I've rebased the current patch on top of that one. Lines 2250-2257 in the current patch are the primary functional change now. This check is now only in one place (in the new widenIntInduction function I added), unlike before the refactoring, where it was in two places (with the int induction case and the trunc case). Hope that helps!

Ah, great! Let me review in the right order than.

anemet added a comment.Jul 1 2016, 1:05 PM

This still LGTM after reading the patches in the right order.

Please don't forget to mention the PR in the commit log and update the PR as well. Thanks.

Will do. Thanks, Adam.

This revision was automatically updated to reflect the committed changes.