This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
27/28
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
arbitrary-induction-step.ll
-
X86/
-
tail_loop_folding.ll
-
if-pred-stores.ll
-
interleaved-accesses.ll
-
no_int_induction.ll
4/4
nuw.ll
-
reduction-small-size.ll
-
reduction.ll

Differential D69563

[LV] Strip wrap flags from vectorized reductions
ClosedPublic

Authored by dantrushin on Oct 29 2019, 6:50 AM.

Download Raw Diff

Details

Reviewers

fhahn
hsaito
hfinkel
Ayal
lebedev.ri

Commits

rGe498be573871: [LV] Strip wrap flags from vectorized reductions

Summary

Wrap flags set on scalar reductions may become invalid after vectorization
After vectorization with interleaving, vectorized reduction might look
like this:

%vec.phi  = phi <4 x i32> [ <i32 -104, i32 0, i32 0, i32 0>, %vector.ph ], [ %2, %vector.body ]
%vec.phi2 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%2 = sub nuw nsw <4 x i32> %vec.phi, %broadcast.splat
%3 = sub nuw nsw <4 x i32> %vec.phi2, %broadcast.splat

Note that nowrap flags are invalid (0 - x wraps) and must be reset even
though they were correct in scalar case. Otherwise, InstSimplify will
throw (0-X)<nsw> out as a no-op

Fixes PR43828

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dantrushin created this revision.Oct 29 2019, 6:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 29 2019, 6:50 AM

Herald added subscribers: jfb, rkruppe, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B40190: Diff 226892.Oct 29 2019, 6:52 AM

I remember seeing similar patch for SLPVectorizer(?).
There the consensus was that the flags that are present on all instructions, can be preserved.
This isn't applicable here?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	if(auto* I = dyn_cast<Instruction>(LoopVal)) I->dropPoisonGeneratingFlags(); Shouldn't this be done outside of this loop though?

In D69563#1725269, @lebedev.ri wrote:

I remember seeing similar patch for SLPVectorizer(?).
There the consensus was that the flags that are present on all instructions, can be preserved.
This isn't applicable here?

Reductions are special case, IMHO. As you can see, interleaving creates new vector IV {0,-,X}, which obviously wraps,
even though original scalar {-104,-,x} did not

So I believe we have bug here.
I need a fix for it so I had to create this review to get wheels rolling, because other means (asking on llvm-dev and filing PR) did not work ;-P

dantrushin marked an inline comment as done.Oct 29 2019, 7:19 AM

dantrushin added inline comments.Oct 29 2019, 7:19 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	No, I think I want them cleared from every interleaved part

lebedev.ri added inline comments.Nov 4 2019, 12:41 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	Still, if(StripWrapFlags) cast<Instruction>(Val)->dropPoisonGeneratingFlags(); Isn't the actual bug is in `getOrCreateVectorValue()`? why it returns such errneous instructions?

dantrushin marked an inline comment as done.Nov 5 2019, 1:49 AM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction: 4202 Value V = Builder.CreateNAryOp(I.getOpcode(), Ops); 4203 4204 if (auto VecOp = dyn_cast<Instruction>(V)) 4205 VecOp->copyIRFlags(&I); `getOrCreateVectorValue()` simply returns cached vector value from `VectorLoopValueMap`. And this instruction was created by `InnerLoopVectorized::widenInstruction()`, which lacks necessary context. So we basically back to the original question - if we do not want to preserve valid flags at all, we can simply not copy them in `widenInstruction()` (VectorOp->copyIRFlags(&I, false) ). But if we do want to preserve flags where they're valid, we need a context and clear them only for reductions.

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

lebedev.ri requested changes to this revision.Nov 8 2019, 12:26 PM

lebedev.ri added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction: I do not understand. How is the code in the current diff if (StripWrapFlags) { cast<Instruction>(Val)->setHasNoUnsignedWrap(false); cast<Instruction>(Val)->setHasNoSignedWrap(false); } different from what i suggest: if(StripWrapFlags) cast<Instruction>(Val)->dropPoisonGeneratingFlags(); ? Both drop NSW/NUW from `Val`.

This revision now requires changes to proceed.Nov 8 2019, 12:26 PM

dantrushin updated this revision to Diff 228573.Nov 9 2019, 7:04 AM

dantrushin marked an inline comment as done.

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731	Sorry, I misunderstood your comment. I had an impression you still want me to hoist it out of loop. I updated diff

In D69563#1739287, @hsaito wrote:

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

I'm not familiar with LLVM enough to answer your question. I hoped you'll tell me.
But I would expect NSW/NUW flags to be correct, since LLVM performs transformation based on them (e.g., throws some subexpressions out)
I also think that (speaking of vectorization) vector reduction is somewhat special/simpler case of reassociation and in my example, not clearing wrap flags leads to incorrect result even in case where
there cannot be any overflow issues in scalar case.
For other compiler, I would simply clear NSW/NUW flags in InnerLoopVectorizer::widenInstruction() and recompute them afresh after vectorization,
but my understanding is that LLVM tries to preserve as much analysis as possible during transformations
I need that bug fixed and I'll be grateful for any suggestions how to fix it properly.

dantrushin marked an inline comment as done.Nov 9 2019, 7:33 AM

Ping.
@lebedev.ri : did I addressed your comments?

@hsaito: What do you think?

dantrushin marked an inline comment as done.Nov 21 2019, 6:14 AM

In D69563#1748967, @dantrushin wrote:

@hsaito: What do you think?

Checked with a few of my colleagues. They say

SCEV drops a lot of no-wrap flags because of inherent reassociation issue in design
InstCombine and scalar reassociate passes propagate the wrap[ flags if they can prove the flags are still valid.

So, dropping is the right thing to do. Thanks for raising the issue and fixing it.

This patch looks good to me. I agree it's nicer if we could fix it during widenInstruction, but let's leave that fix to the VPlan centric code generation, where we expect to (eventually) have the necessary context.

In D69563#1748967, @dantrushin wrote:

Ping.
@lebedev.ri : did I addressed your comments?

Yes. Though it still looks wrong to me that getOrCreateVectorValue() seems to return a value that has incorrect no-wrap flags that we need to later drop.

@hsaito: What do you think?

Thanks.
If we're done, could you approve it and commit it for me? I have no commit access yet :-/

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that (1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and (2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests. All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Folding the partial sums in the middle block is already done w/o wrap flags, and the scalar loop used for leftover iterations and/or runtime guard default can continue to retain wrap flags.

In D69563#1763159, @Ayal wrote:

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?
Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

In D69563#1763331, @dantrushin wrote:

In D69563#1763159, @Ayal wrote:

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

Right, need to look through select and phi nodes, and possibly other (trunc/sext) instructions.

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Yes, that may be an alternative implementation.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

Update according to @Ayal's comments.

Collect all instructions comprising reduction and clear their
flags if appropriate.
Add lost test/test updates

Harbormaster completed remote builds in B41712: Diff 231691.Dec 2 2019, 6:10 AM

fhahn added inline comments.Dec 2 2019, 6:11 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383	formatting seems off. Could you run clang-format on the diff?

Removed debug leftovers...

Harbormaster completed remote builds in B41713: Diff 231692.Dec 2 2019, 6:12 AM

Fixed bad formatting

Harbormaster completed remote builds in B41714: Diff 231698.Dec 2 2019, 6:28 AM

dantrushin marked 2 inline comments as done.Dec 2 2019, 6:31 AM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383	Thanks for catching this. Fixed

dantrushin added a reviewer: Ayal.Dec 2 2019, 6:32 AM

dantrushin marked an inline comment as done.

lebedev.ri resigned from this revision.Dec 7 2019, 5:01 AM

Add test(s) having multiple wrapping reduction instructions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
382	End sentence with period.
383	collectReduInstructions >> collectReductionInstructions
394	Phi was initially inserted into Result, so the check if UI==Phi is covered by the check if Result.insert(UI).second. Can fold into if (L->contains(UI->getParent()) && Result.insert(UI).second) Worklist.push_back(UI);
3726	ReduList >> ReductionInstructions
3728	might become >> are in general Start sentence with capital (Wrap) and end with period.
3730	Better interchange to loop over all Parts inside. Can then potentially do "if (!isa<OverflowingBinaryOperator>(I)) continue;" if preferred.
llvm/test/Transforms/LoopVectorize/nuw.ll
3	Mention PR43828, possibly in file or test name. As interleaving is set to 2, check both sub's.
5	Tests that use x86 targets should be placed under LoopVectorize/X86 directory.

dantrushin marked 9 inline comments as done.Dec 9 2019, 8:29 AM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730	I had to fuse loops before interchanging.

Addressed @Ayal's comments

Harbormaster completed remote builds in B42138: Diff 232869.Dec 9 2019, 8:33 AM

Probably not too much important because should be handled by the vector predicated instructions/intrinsics, but still

In D69563#1764483, @Ayal wrote:

In D69563#1763331, @dantrushin wrote:

In D69563#1763159, @Ayal wrote:

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

Why is it ok not to drop nuw here:

define i8 @function0(i8 %a) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  %cmp5 = icmp ult i8 %a, 127
  br i1 %cmp5, label %if.then, label %if.end

if.then:
  %mul = mul nuw i8 %a, 2
  br label %if.end

if.end:
  %k = phi i8 [ %mul, %if.then ], [ %a, %for.body ]
  %indvars.iv.next = add i32 %indvars.iv, 1
  %cmp = icmp slt i32 %indvars.iv.next, 42
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret i8 undef
}

Vector code generated is

vector.ph:                                        ; preds = %entry
  %broadcast.splatinsert1 = insertelement <4 x i8> undef, i8 %a, i32 0
  %broadcast.splat2 = shufflevector <4 x i8> %broadcast.splatinsert1, <4 x i8> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = add i32 %index, 0
  %1 = icmp ult <4 x i8> %broadcast.splat2, <i8 -128, i8 -128, i8 -128, i8 -128>
  %2 = mul nuw <4 x i8> %broadcast.splat2, <i8 2, i8 2, i8 2, i8 2>                                  ; if %a == 200, this is poison...
  %3 = xor <4 x i1> %1, <i1 true, i1 true, i1 true, i1 true>
  %predphi = select <4 x i1> %3, <4 x i8> %broadcast.splat2, <4 x i8> %2                     ; ... even though the %predphi == %a broadcasted, it's still poison as it depends on %2 (according to https://llvm.org/docs/LangRef.html#poisonvalues)
  %index.next = add i32 %index, 4
  %4 = icmp eq i32 %index.next, 40
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !0

Do I miss anything important here that allows us not to drop "nuw" flags?

Ping

Ayal added inline comments.Dec 17 2019, 12:21 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394	This is indeed a general way to record all transitively dependent instructions inside a loop. In our case, though, there's a single known `LoopExitInst` with a single (loop-closed phi) user outside the loop. More efficient to record that user and check if (UI != OutsideUser && Result.insert(UI).second) than to repeatedly check if parent block belongs to L. Agreed?
3722	Worth restricting this wrap-dropping treatment to RecurrenceKind's that may wrap, namely RK_IntegerAdd and RK_IntegerMult.
3723	RedictionInstructions >> ReductionInstructions

Addressed comments.

Harbormaster completed remote builds in B42655: Diff 234277.Dec 17 2019, 5:04 AM

dantrushin marked 4 inline comments as done.Dec 17 2019, 5:05 AM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394	IMHO this makes code uglier. See below

In D69563#1776137, @a.elovikov wrote:
Probably not too much important because should be handled by the vector predicated instructions/intrinsics, but still

In D69563#1764483, @Ayal wrote:

In D69563#1763331, @dantrushin wrote:

In D69563#1763159, @Ayal wrote:

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

Why is it ok not to drop nuw here:
define i8 @function0(i8 %a) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  %cmp5 = icmp ult i8 %a, 127
  br i1 %cmp5, label %if.then, label %if.end

if.then:
  %mul = mul nuw i8 %a, 2
  br label %if.end

if.end:
  %k = phi i8 [ %mul, %if.then ], [ %a, %for.body ]
  %indvars.iv.next = add i32 %indvars.iv, 1
  %cmp = icmp slt i32 %indvars.iv.next, 42
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret i8 undef
}
Vector code generated is
vector.ph:                                        ; preds = %entry
  %broadcast.splatinsert1 = insertelement <4 x i8> undef, i8 %a, i32 0
  %broadcast.splat2 = shufflevector <4 x i8> %broadcast.splatinsert1, <4 x i8> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = add i32 %index, 0
  %1 = icmp ult <4 x i8> %broadcast.splat2, <i8 -128, i8 -128, i8 -128, i8 -128>
  %2 = mul nuw <4 x i8> %broadcast.splat2, <i8 2, i8 2, i8 2, i8 2>                                  ; if %a == 200, this is poison...
  %3 = xor <4 x i1> %1, <i1 true, i1 true, i1 true, i1 true>
  %predphi = select <4 x i1> %3, <4 x i8> %broadcast.splat2, <4 x i8> %2                     ; ... even though the %predphi == %a broadcasted, it's still poison as it depends on %2 (according to https://llvm.org/docs/LangRef.html#poisonvalues)
  %index.next = add i32 %index, 4
  %4 = icmp eq i32 %index.next, 40
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !0
Do I miss anything important here that allows us not to drop "nuw" flags?

You are right, I stand corrected, conditional instructions that are executed speculatively need to drop their wrap flags as well. Good catch!
This is reminiscent of conditional instructions that are "assume", which need to be dropped altogether, D68814.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394	Perhaps offloading the code that searches for OutsideUser over to collectReductionInstructions() may look better, emphasizing that it's an implementation optimization internal to the collection process. That would require passing the loop again and also LoopExitInst. Another, orthogonal option may be to fuse the loop collecting reduction instructions with the one dropping the wraps, eliminating the need for a ReductionInstructions set. These options are more a matter of style, as you prefer. The current version and also the previous general one look good to me; we're not expecting too many UI instructions.
3720	The above comment belongs with the code below that fixes the vector-loop phi. Instead, a comment about fixing/dropping wraps can be made here; e.g., can move the "Wrap flags are in general..." comment here.
3732	Would be good to assert here that OuterUser is not nullptr, or even better that its a (loop closed) phi.

Moved all reduction flag stuff into single procedure (I had to make
it member of InnerLoopVectorizer).

Reverted back to Loop->contains() as it seems cleanest to me and
reductions do not contain more than handful instructions usually.

Harbormaster completed remote builds in B42732: Diff 234528.Dec 18 2019, 6:18 AM

dantrushin marked 4 inline comments as done.Dec 18 2019, 6:19 AM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394	Indeed, doing this in a single loop is a good idea. But it needs to be a member method of InnerLoopVectorizer then.

No need for collectReductionInstructions() any more.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875	There's no need to deal with LoopExitInstr in this implementation; it's probably clearer to start from the Phi as in the original version.
llvm/test/Transforms/LoopVectorize/nuw.ll
15	Note that `vector.body`'s are usually CHECK'd instead of CHECK-LABEL'd, just for consistency with other tests. Consider running update_test_checks.py to auto-generate the CHECKs.
42	ditto.

dantrushin marked 2 inline comments as done.Dec 18 2019, 1:15 PM

dantrushin added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875	I'm passing `RecurrenceDescriptor` to this function (so I can easily check reduction type here). I only can get loop exit instruction from it, not Phi. Here I need any reduction instruction to start with, and I don't see much difference between `Phi` and `LoopExitInstr`. Do you want me to pass in Phi instead and check reduction type at call site?

Ayal added inline comments.Dec 18 2019, 2:32 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875	OK, right. Can pass both RdxDesc and Phi, but using LoopExitInstr is also fine. It is tempting though to then try and save redundant Loop->contains() calls, perhaps by restricting them only to users of LoopExitInstr, as in: if ((Cur != LoopExitInstr \|\| OrigLoop->contains(UI->getParent())) && Visited.insert(UI).second) ?

Addressed comments:

removed unsused function
updated tests
added short circuit compare to LoopExitInstr

dantrushin marked 4 inline comments as done.Dec 19 2019, 3:55 AM

Harbormaster completed remote builds in B42777: Diff 234694.Dec 19 2019, 3:55 AM

This looks good to me, with couple of final nits; thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
534	nit: end above sentence with period.
3875	nit: can shorten to `assert(LoopExitInstr && "null loop exit instruction");`

This revision is now accepted and ready to land.Dec 19 2019, 4:36 AM

Fixed final nits

@Ayal : Thanks! Could you commit it for me, as I have no commit access yet?

Harbormaster completed remote builds in B42778: Diff 234710.Dec 19 2019, 6:25 AM

Closed by commit rGe498be573871: [LV] Strip wrap flags from vectorized reductions (authored by Ayal). · Explain WhyDec 20 2019, 5:10 AM

This revision was automatically updated to reflect the committed changes.

In D69563#1791132, @dantrushin wrote:

@Ayal : Thanks! Could you commit it for me, as I have no commit access yet?

Done. There were a couple of missing fixes to AArch64/arbitrary-induction-step.ll that were added.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

40 lines

test/

Transforms/

LoopVectorize/

AArch64/

arbitrary-induction-step.ll

18 lines

X86/

tail_loop_folding.ll

2 lines

if-pred-stores.ll

4 lines

interleaved-accesses.ll

10 lines

no_int_induction.ll

4 lines

nuw.ll

58 lines

reduction-small-size.ll

2 lines

reduction.ll

2 lines

Diff 234864

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 373 Lines • ▼ Show 20 Lines	static Optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE, Loop *L) {

// Check if upper bound estimate is known.		// Check if upper bound estimate is known.
if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))		if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
return ExpectedTC;		return ExpectedTC;

return None;		return None;
}		}

namespace llvm {		namespace llvm {
		AyalUnsubmitted Done Reply Inline Actions End sentence with period. Ayal: End sentence with period.

		fhahnUnsubmitted Done Reply Inline Actions formatting seems off. Could you run clang-format on the diff? fhahn: formatting seems off. Could you run clang-format on the diff?
		dantrushinAuthorUnsubmitted Done Reply Inline Actions Thanks for catching this. Fixed dantrushin: Thanks for catching this. Fixed
		AyalUnsubmitted Done Reply Inline Actions collectReduInstructions >> collectReductionInstructions Ayal: collectReduInstructions >> collectReductionInstructions
/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
/// instructions.		/// instructions.
/// InnerLoopVectorizer does not perform any vectorization-legality		/// InnerLoopVectorizer does not perform any vectorization-legality
/// checks, and relies on the caller to check for the different legality		/// checks, and relies on the caller to check for the different legality
		AyalUnsubmitted Done Reply Inline Actions Phi was initially inserted into Result, so the check if UI==Phi is covered by the check if Result.insert(UI).second. Can fold into if (L->contains(UI->getParent()) && Result.insert(UI).second) Worklist.push_back(UI); Ayal: Phi was initially inserted into Result, so the check if UI==Phi is covered by the check if…
		AyalUnsubmitted Done Reply Inline Actions This is indeed a general way to record all transitively dependent instructions inside a loop. In our case, though, there's a single known `LoopExitInst` with a single (loop-closed phi) user outside the loop. More efficient to record that user and check if (UI != OutsideUser && Result.insert(UI).second) than to repeatedly check if parent block belongs to L. Agreed? Ayal: This is indeed a general way to record all transitively dependent instructions inside a loop.
		dantrushinAuthorUnsubmitted Done Reply Inline Actions IMHO this makes code uglier. See below dantrushin: IMHO this makes code uglier. See below
		AyalUnsubmitted Done Reply Inline Actions Perhaps offloading the code that searches for OutsideUser over to collectReductionInstructions() may look better, emphasizing that it's an implementation optimization internal to the collection process. That would require passing the loop again and also LoopExitInst. Another, orthogonal option may be to fuse the loop collecting reduction instructions with the one dropping the wraps, eliminating the need for a ReductionInstructions set. These options are more a matter of style, as you prefer. The current version and also the previous general one look good to me; we're not expecting too many UI instructions. Ayal: Perhaps offloading the code that searches for OutsideUser over to collectReductionInstructions…
		dantrushinAuthorUnsubmitted Done Reply Inline Actions Indeed, doing this in a single loop is a good idea. But it needs to be a member method of InnerLoopVectorizer then. dantrushin: Indeed, doing this in a single loop is a good idea. But it needs to be a member method of…
/// aspects. The InnerLoopVectorizer relies on the		/// aspects. The InnerLoopVectorizer relies on the
/// LoopVectorizationLegality class to provide information about the induction		/// LoopVectorizationLegality class to provide information about the induction
/// and reduction variables that were found to a given vectorization factor.		/// and reduction variables that were found to a given vectorization factor.
class InnerLoopVectorizer {		class InnerLoopVectorizer {
public:		public:
InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	protected:
/// Fix a first-order recurrence. This is the second phase of vectorizing		/// Fix a first-order recurrence. This is the second phase of vectorizing
/// this phi node.		/// this phi node.
void fixFirstOrderRecurrence(PHINode *Phi);		void fixFirstOrderRecurrence(PHINode *Phi);

/// Fix a reduction cross-iteration phi. This is the second phase of		/// Fix a reduction cross-iteration phi. This is the second phase of
/// vectorizing this phi node.		/// vectorizing this phi node.
void fixReduction(PHINode *Phi);		void fixReduction(PHINode *Phi);

		/// Clear NSW/NUW flags from reduction instructions if necessary.
		AyalUnsubmitted Done Reply Inline Actions nit: end above sentence with period. Ayal: nit: end above sentence with period.
		void clearReductionWrapFlags(RecurrenceDescriptor &RdxDesc);

/// The Loop exit block may have single value PHI nodes with some		/// The Loop exit block may have single value PHI nodes with some
/// incoming value. While vectorizing we only handled real values		/// incoming value. While vectorizing we only handled real values
/// that were defined inside the loop and we should have one value for		/// that were defined inside the loop and we should have one value for
/// each predecessor of its parent basic block. See PR14725.		/// each predecessor of its parent basic block. See PR14725.
void fixLCSSAPHIs();		void fixLCSSAPHIs();

/// Iteratively sink the scalarized operands of a predicated instruction into		/// Iteratively sink the scalarized operands of a predicated instruction into
/// the block that was created for it.		/// the block that was created for it.
▲ Show 20 Lines • Show All 3,164 Lines • ▼ Show 20 Lines	if (VF == 1) {

// This vector is the Identity vector where the first element is the		// This vector is the Identity vector where the first element is the
// incoming scalar reduction.		// incoming scalar reduction.
VectorStart =		VectorStart =
Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);		Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);
}		}
}		}

		// Wrap flags are in general invalid after vectorization, clear them.
		clearReductionWrapFlags(RdxDesc);

// Fix the vector-loop phi.		// Fix the vector-loop phi.
		AyalUnsubmitted Done Reply Inline Actions The above comment belongs with the code below that fixes the vector-loop phi. Instead, a comment about fixing/dropping wraps can be made here; e.g., can move the "Wrap flags are in general..." comment here. Ayal: The above comment belongs with the code below that fixes the vector-loop phi. Instead, a…

// Reductions do not have to start at zero. They can start with		// Reductions do not have to start at zero. They can start with
		AyalUnsubmitted Done Reply Inline Actions Worth restricting this wrap-dropping treatment to RecurrenceKind's that may wrap, namely RK_IntegerAdd and RK_IntegerMult. Ayal: Worth restricting this wrap-dropping treatment to RecurrenceKind's that may wrap, namely…
// any loop invariant values.		// any loop invariant values.
		AyalUnsubmitted Done Reply Inline Actions RedictionInstructions >> ReductionInstructions Ayal: RedictionInstructions >> ReductionInstructions
BasicBlock *Latch = OrigLoop->getLoopLatch();		BasicBlock *Latch = OrigLoop->getLoopLatch();
Value *LoopVal = Phi->getIncomingValueForBlock(Latch);		Value *LoopVal = Phi->getIncomingValueForBlock(Latch);

		AyalUnsubmitted Done Reply Inline Actions ReduList >> ReductionInstructions Ayal: ReduList >> ReductionInstructions
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *VecRdxPhi = getOrCreateVectorValue(Phi, Part);		Value *VecRdxPhi = getOrCreateVectorValue(Phi, Part);
		AyalUnsubmitted Done Reply Inline Actions might become >> are in general Start sentence with capital (Wrap) and end with period. Ayal: might become >> are in general Start sentence with capital (Wrap) and end with period.
Value *Val = getOrCreateVectorValue(LoopVal, Part);		Value *Val = getOrCreateVectorValue(LoopVal, Part);
// Make sure to add the reduction stat value only to the		// Make sure to add the reduction start value only to the
		AyalUnsubmitted Done Reply Inline Actions Better interchange to loop over all Parts inside. Can then potentially do "if (!isa<OverflowingBinaryOperator>(I)) continue;" if preferred. Ayal: Better interchange to loop over all Parts inside. Can then potentially do "if (!
		dantrushinAuthorUnsubmitted Done Reply Inline Actions I had to fuse loops before interchanging. dantrushin: I had to fuse loops before interchanging.
// first unroll part.		// first unroll part.
		lebedev.riUnsubmitted Not Done Reply Inline Actions if(auto* I = dyn_cast<Instruction>(LoopVal)) I->dropPoisonGeneratingFlags(); Shouldn't this be done outside of this loop though? lebedev.ri: ``` if(auto* I = dyn_cast<Instruction>(LoopVal)) I->dropPoisonGeneratingFlags(); ```…
		dantrushinAuthorUnsubmitted Done Reply Inline Actions No, I think I want them cleared from every interleaved part dantrushin: No, I think I want them cleared from every interleaved part
		lebedev.riUnsubmitted Done Reply Inline Actions Still, if(StripWrapFlags) cast<Instruction>(Val)->dropPoisonGeneratingFlags(); Isn't the actual bug is in `getOrCreateVectorValue()`? why it returns such errneous instructions? lebedev.ri: 1. Still, ``` if(StripWrapFlags) cast<Instruction>(Val)->dropPoisonGeneratingFlags(); ``` 2.
		dantrushinAuthorUnsubmitted Done Reply Inline Actions But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction: 4202 Value V = Builder.CreateNAryOp(I.getOpcode(), Ops); 4203 4204 if (auto VecOp = dyn_cast<Instruction>(V)) 4205 VecOp->copyIRFlags(&I); `getOrCreateVectorValue()` simply returns cached vector value from `VectorLoopValueMap`. And this instruction was created by `InnerLoopVectorized::widenInstruction()`, which lacks necessary context. So we basically back to the original question - if we do not want to preserve valid flags at all, we can simply not copy them in `widenInstruction()` (VectorOp->copyIRFlags(&I, false) ). But if we do want to preserve flags where they're valid, we need a context and clear them only for reductions. dantrushin: 1. But these flags are valid for Val -- it is (original) scalar instruction, which still…
		lebedev.riUnsubmitted Done Reply Inline Actions But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction: I do not understand. How is the code in the current diff if (StripWrapFlags) { cast<Instruction>(Val)->setHasNoUnsignedWrap(false); cast<Instruction>(Val)->setHasNoSignedWrap(false); } different from what i suggest: if(StripWrapFlags) cast<Instruction>(Val)->dropPoisonGeneratingFlags(); ? Both drop NSW/NUW from `Val`. lebedev.ri: > But these flags are valid for Val -- it is (original) scalar instruction, which still present…
		dantrushinAuthorUnsubmitted Done Reply Inline Actions Sorry, I misunderstood your comment. I had an impression you still want me to hoist it out of loop. I updated diff dantrushin: Sorry, I misunderstood your comment. I had an impression you still want me to hoist it out of…
Value *StartVal = (Part == 0) ? VectorStart : Identity;		Value *StartVal = (Part == 0) ? VectorStart : Identity;
		AyalUnsubmitted Done Reply Inline Actions Would be good to assert here that OuterUser is not nullptr, or even better that its a (loop closed) phi. Ayal: Would be good to assert here that OuterUser is not nullptr, or even better that its a (loop…
cast<PHINode>(VecRdxPhi)->addIncoming(StartVal, LoopVectorPreHeader);		cast<PHINode>(VecRdxPhi)->addIncoming(StartVal, LoopVectorPreHeader);
cast<PHINode>(VecRdxPhi)		cast<PHINode>(VecRdxPhi)
->addIncoming(Val, LI->getLoopFor(LoopVectorBody)->getLoopLatch());		->addIncoming(Val, LI->getLoopFor(LoopVectorBody)->getLoopLatch());
}		}

// Before each round, move the insertion point right between		// Before each round, move the insertion point right between
// the PHIs and the values we are going to write.		// the PHIs and the values we are going to write.
// This allows us to write both PHINodes and the extractelement		// This allows us to write both PHINodes and the extractelement
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	int IncomingEdgeBlockIdx =
Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());		Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());
assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");		assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");
// Pick the other block.		// Pick the other block.
int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);		int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);
Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);		Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);		Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
}		}

		void InnerLoopVectorizer::clearReductionWrapFlags(
		RecurrenceDescriptor &RdxDesc) {
		RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();
		if (RK != RecurrenceDescriptor::RK_IntegerAdd &&
		RK != RecurrenceDescriptor::RK_IntegerMult)
		return;

		Instruction *LoopExitInstr = RdxDesc.getLoopExitInstr();
		assert(LoopExitInstr && "null loop exit instruction");
		AyalUnsubmitted Done Reply Inline Actions There's no need to deal with LoopExitInstr in this implementation; it's probably clearer to start from the Phi as in the original version. Ayal: There's no need to deal with LoopExitInstr in this implementation; it's probably clearer to…
		dantrushinAuthorUnsubmitted Done Reply Inline Actions I'm passing `RecurrenceDescriptor` to this function (so I can easily check reduction type here). I only can get loop exit instruction from it, not Phi. Here I need any reduction instruction to start with, and I don't see much difference between `Phi` and `LoopExitInstr`. Do you want me to pass in Phi instead and check reduction type at call site? dantrushin: I'm passing `RecurrenceDescriptor` to this function (so I can easily check reduction type here).
		AyalUnsubmitted Done Reply Inline Actions OK, right. Can pass both RdxDesc and Phi, but using LoopExitInstr is also fine. It is tempting though to then try and save redundant Loop->contains() calls, perhaps by restricting them only to users of LoopExitInstr, as in: if ((Cur != LoopExitInstr \|\| OrigLoop->contains(UI->getParent())) && Visited.insert(UI).second) ? Ayal: OK, right. Can pass both RdxDesc and Phi, but using LoopExitInstr is also fine. It is tempting…
		AyalUnsubmitted Done Reply Inline Actions nit: can shorten to `assert(LoopExitInstr && "null loop exit instruction");` Ayal: nit: can shorten to `assert(LoopExitInstr && "null loop exit instruction");`
		SmallVector<Instruction *, 8> Worklist;
		SmallPtrSet<Instruction *, 8> Visited;
		Worklist.push_back(LoopExitInstr);
		Visited.insert(LoopExitInstr);

		while (!Worklist.empty()) {
		Instruction *Cur = Worklist.pop_back_val();
		if (isa<OverflowingBinaryOperator>(Cur))
		for (unsigned Part = 0; Part < UF; ++Part) {
		Value *V = getOrCreateVectorValue(Cur, Part);
		cast<Instruction>(V)->dropPoisonGeneratingFlags();
		}

		for (User *U : Cur->users()) {
		Instruction *UI = cast<Instruction>(U);
		if ((Cur != LoopExitInstr \|\| OrigLoop->contains(UI->getParent())) &&
		Visited.insert(UI).second)
		Worklist.push_back(UI);
		}
		}
		}

void InnerLoopVectorizer::fixLCSSAPHIs() {		void InnerLoopVectorizer::fixLCSSAPHIs() {
for (PHINode &LCSSAPhi : LoopExitBlock->phis()) {		for (PHINode &LCSSAPhi : LoopExitBlock->phis()) {
if (LCSSAPhi.getNumIncomingValues() == 1) {		if (LCSSAPhi.getNumIncomingValues() == 1) {
auto *IncomingValue = LCSSAPhi.getIncomingValue(0);		auto *IncomingValue = LCSSAPhi.getIncomingValue(0);
// Non-instruction incoming values will have only one value.		// Non-instruction incoming values will have only one value.
unsigned LastLane = 0;		unsigned LastLane = 0;
if (isa<Instruction>(IncomingValue))		if (isa<Instruction>(IncomingValue))
LastLane = Cost->isUniformAfterVectorization(		LastLane = Cost->isUniformAfterVectorization(
▲ Show 20 Lines • Show All 4,087 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

	Show All 9 Lines
	; sum += i * tmp;			; sum += i * tmp;
	; }			; }

	; CHECK-LABEL: @ind_plus2(			; CHECK-LABEL: @ind_plus2(
	; CHECK: load <4 x i32>, <4 x i32>*			; CHECK: load <4 x i32>, <4 x i32>*
	; CHECK: load <4 x i32>, <4 x i32>*			; CHECK: load <4 x i32>, <4 x i32>*
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: %index.next = add i64 %index, 8			; CHECK: %index.next = add i64 %index, 8
	; CHECK: icmp eq i64 %index.next, 512			; CHECK: icmp eq i64 %index.next, 512

	; FORCE-VEC-LABEL: @ind_plus2(			; FORCE-VEC-LABEL: @ind_plus2(
	; FORCE-VEC: %wide.load = load <2 x i32>, <2 x i32>*			; FORCE-VEC: %wide.load = load <2 x i32>, <2 x i32>*
	; FORCE-VEC: mul nsw <2 x i32>			; FORCE-VEC: mul nsw <2 x i32>
	; FORCE-VEC: add nsw <2 x i32>			; FORCE-VEC: add <2 x i32>
	; FORCE-VEC: %index.next = add i64 %index, 2			; FORCE-VEC: %index.next = add i64 %index, 2
	; FORCE-VEC: icmp eq i64 %index.next, 512			; FORCE-VEC: icmp eq i64 %index.next, 512
	define i32 @ind_plus2(i32* %A) {			define i32 @ind_plus2(i32* %A) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr, %for.body ]			%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr, %for.body ]
	Show All 19 Lines
	; sum += i * tmp;			; sum += i * tmp;
	; }			; }

	; CHECK-LABEL: @ind_minus2(			; CHECK-LABEL: @ind_minus2(
	; CHECK: load <4 x i32>, <4 x i32>*			; CHECK: load <4 x i32>, <4 x i32>*
	; CHECK: load <4 x i32>, <4 x i32>*			; CHECK: load <4 x i32>, <4 x i32>*
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: %index.next = add i64 %index, 8			; CHECK: %index.next = add i64 %index, 8
	; CHECK: icmp eq i64 %index.next, 512			; CHECK: icmp eq i64 %index.next, 512

	; FORCE-VEC-LABEL: @ind_minus2(			; FORCE-VEC-LABEL: @ind_minus2(
	; FORCE-VEC: %wide.load = load <2 x i32>, <2 x i32>*			; FORCE-VEC: %wide.load = load <2 x i32>, <2 x i32>*
	; FORCE-VEC: mul nsw <2 x i32>			; FORCE-VEC: mul nsw <2 x i32>
	; FORCE-VEC: add nsw <2 x i32>			; FORCE-VEC: add <2 x i32>
	; FORCE-VEC: %index.next = add i64 %index, 2			; FORCE-VEC: %index.next = add i64 %index, 2
	; FORCE-VEC: icmp eq i64 %index.next, 512			; FORCE-VEC: icmp eq i64 %index.next, 512
	define i32 @ind_minus2(i32* %A) {			define i32 @ind_minus2(i32* %A) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr, %for.body ]			%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr, %for.body ]
	Show All 26 Lines
	; CHECK: %[[V0:.*]] = load <8 x i32>			; CHECK: %[[V0:.*]] = load <8 x i32>
	; CHECK: %[[V1:.*]] = load <8 x i32>			; CHECK: %[[V1:.*]] = load <8 x i32>
	; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: mul nsw <4 x i32>			; CHECK: mul nsw <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: %index.next = add i64 %index, 8			; CHECK: %index.next = add i64 %index, 8
	; CHECK: icmp eq i64 %index.next, 1024			; CHECK: icmp eq i64 %index.next, 1024

	; FORCE-VEC-LABEL: @ptr_ind_plus2(			; FORCE-VEC-LABEL: @ptr_ind_plus2(
	; FORCE-VEC: %[[V:.*]] = load <4 x i32>			; FORCE-VEC: %[[V:.*]] = load <4 x i32>
	; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>			; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>
	; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>			; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>
	; FORCE-VEC: mul nsw <2 x i32>			; FORCE-VEC: mul nsw <2 x i32>
	; FORCE-VEC: add nsw <2 x i32>			; FORCE-VEC: add <2 x i32>
	; FORCE-VEC: %index.next = add i64 %index, 2			; FORCE-VEC: %index.next = add i64 %index, 2
	; FORCE-VEC: icmp eq i64 %index.next, 1024			; FORCE-VEC: icmp eq i64 %index.next, 1024
	define i32 @ptr_ind_plus2(i32* %A) {			define i32 @ptr_ind_plus2(i32* %A) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr1, %for.body ]			%A.addr = phi i32* [ %A, %entry ], [ %inc.ptr1, %for.body ]
	Show All 16 Lines

llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

	Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @reduction_i32(			; CHECK-LABEL: @reduction_i32(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
	; CHECK-NEXT: [[ACCUM_PHI:%.]] = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ [[ACCUM:%.]], %vector.body ]			; CHECK-NEXT: [[ACCUM_PHI:%.]] = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ [[ACCUM:%.]], %vector.body ]
	; CHECK: [[ICMPULE:%.*]] = icmp ule <8 x i64>			; CHECK: [[ICMPULE:%.*]] = icmp ule <8 x i64>
	; CHECK: [[LOAD1:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> {{.*}}, i32 4, <8 x i1> [[ICMPULE]], <8 x i32> undef)			; CHECK: [[LOAD1:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> {{.*}}, i32 4, <8 x i1> [[ICMPULE]], <8 x i32> undef)
	; CHECK: [[LOAD2:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> {{.*}}, i32 4, <8 x i1> [[ICMPULE]], <8 x i32> undef)			; CHECK: [[LOAD2:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> {{.*}}, i32 4, <8 x i1> [[ICMPULE]], <8 x i32> undef)
	; CHECK-NEXT: [[ADD:%.*]] = add nsw <8 x i32> [[LOAD2]], [[LOAD1]]			; CHECK-NEXT: [[ADD:%.*]] = add nsw <8 x i32> [[LOAD2]], [[LOAD1]]
	; CHECK-NEXT: [[ACCUM]] = add nuw nsw <8 x i32> [[ADD]], [[ACCUM_PHI]]			; CHECK-NEXT: [[ACCUM]] = add <8 x i32> [[ADD]], [[ACCUM_PHI]]
	; CHECK: [[LIVEOUT:%.*]] = select <8 x i1> [[ICMPULE]], <8 x i32> [[ACCUM]], <8 x i32> [[ACCUM_PHI]]			; CHECK: [[LIVEOUT:%.*]] = select <8 x i1> [[ICMPULE]], <8 x i32> [[ACCUM]], <8 x i32> [[ACCUM_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[LIVEOUT]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[LIVEOUT]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[LIVEOUT]], [[RDX_SHUF]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[LIVEOUT]], [[RDX_SHUF]]
	; CHECK-NEXT: [[RDX_SHUF4:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF4:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX5:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF4]]			; CHECK-NEXT: [[BIN_RDX5:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF4]]
	; CHECK-NEXT: [[RDX_SHUF6:%.*]] = shufflevector <8 x i32> [[BIN_RDX5]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF6:%.*]] = shufflevector <8 x i32> [[BIN_RDX5]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/if-pred-stores.ll

	Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines
	; UNROLL-NOSIMPLIFY-NEXT: store i32 2, i32* [[TMP0]], align 4			; UNROLL-NOSIMPLIFY-NEXT: store i32 2, i32* [[TMP0]], align 4
	; UNROLL-NOSIMPLIFY-NEXT: br label [[PRED_STORE_CONTINUE]]			; UNROLL-NOSIMPLIFY-NEXT: br label [[PRED_STORE_CONTINUE]]
	; UNROLL-NOSIMPLIFY: pred.store.continue:			; UNROLL-NOSIMPLIFY: pred.store.continue:
	; UNROLL-NOSIMPLIFY-NEXT: br i1 undef, label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4]]			; UNROLL-NOSIMPLIFY-NEXT: br i1 undef, label [[PRED_STORE_IF3:%.*]], label [[PRED_STORE_CONTINUE4]]
	; UNROLL-NOSIMPLIFY: pred.store.if3:			; UNROLL-NOSIMPLIFY: pred.store.if3:
	; UNROLL-NOSIMPLIFY-NEXT: store i32 2, i32* [[TMP1]], align 4			; UNROLL-NOSIMPLIFY-NEXT: store i32 2, i32* [[TMP1]], align 4
	; UNROLL-NOSIMPLIFY-NEXT: br label [[PRED_STORE_CONTINUE4]]			; UNROLL-NOSIMPLIFY-NEXT: br label [[PRED_STORE_CONTINUE4]]
	; UNROLL-NOSIMPLIFY: pred.store.continue4:			; UNROLL-NOSIMPLIFY: pred.store.continue4:
	; UNROLL-NOSIMPLIFY-NEXT: [[TMP4:%.*]] = add nsw i32 [[VEC_PHI]], 1			; UNROLL-NOSIMPLIFY-NEXT: [[TMP4:%.*]] = add i32 [[VEC_PHI]], 1
	; UNROLL-NOSIMPLIFY-NEXT: [[TMP5:%.*]] = add nsw i32 [[VEC_PHI2]], 1			; UNROLL-NOSIMPLIFY-NEXT: [[TMP5:%.*]] = add i32 [[VEC_PHI2]], 1
	; UNROLL-NOSIMPLIFY-NEXT: [[PREDPHI]] = select i1 undef, i32 [[VEC_PHI]], i32 [[TMP4]]			; UNROLL-NOSIMPLIFY-NEXT: [[PREDPHI]] = select i1 undef, i32 [[VEC_PHI]], i32 [[TMP4]]
	; UNROLL-NOSIMPLIFY-NEXT: [[PREDPHI5]] = select i1 undef, i32 [[VEC_PHI2]], i32 [[TMP5]]			; UNROLL-NOSIMPLIFY-NEXT: [[PREDPHI5]] = select i1 undef, i32 [[VEC_PHI2]], i32 [[TMP5]]
	; UNROLL-NOSIMPLIFY-NEXT: [[OFFSET_IDX6:%.*]] = add i64 undef, [[INDEX]]			; UNROLL-NOSIMPLIFY-NEXT: [[OFFSET_IDX6:%.*]] = add i64 undef, [[INDEX]]
	; UNROLL-NOSIMPLIFY-NEXT: [[TMP6:%.*]] = trunc i64 [[OFFSET_IDX6]] to i32			; UNROLL-NOSIMPLIFY-NEXT: [[TMP6:%.*]] = trunc i64 [[OFFSET_IDX6]] to i32
	; UNROLL-NOSIMPLIFY-NEXT: [[INDUCTION7:%.*]] = add i32 [[TMP6]], 0			; UNROLL-NOSIMPLIFY-NEXT: [[INDUCTION7:%.*]] = add i32 [[TMP6]], 0
	; UNROLL-NOSIMPLIFY-NEXT: [[INDUCTION8:%.*]] = add i32 [[TMP6]], 1			; UNROLL-NOSIMPLIFY-NEXT: [[INDUCTION8:%.*]] = add i32 [[TMP6]], 1
	; UNROLL-NOSIMPLIFY-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2			; UNROLL-NOSIMPLIFY-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2
	; UNROLL-NOSIMPLIFY-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 0			; UNROLL-NOSIMPLIFY-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], 0
	▲ Show 20 Lines • Show All 288 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll

	Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines
	; }			; }

	; CHECK-LABEL: @test_struct_load4(			; CHECK-LABEL: @test_struct_load4(
	; CHECK: %wide.vec = load <16 x i32>, <16 x i32>* {{.*}}, align 4			; CHECK: %wide.vec = load <16 x i32>, <16 x i32>* {{.*}}, align 4
	; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>			; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
	; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>			; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
	; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>			; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
	; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>			; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: sub <4 x i32>			; CHECK: sub <4 x i32>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: sub <4 x i32>			; CHECK: sub <4 x i32>

	%struct.ST4 = type { i32, i32, i32, i32 }			%struct.ST4 = type { i32, i32, i32, i32 }

	define i32 @test_struct_load4(%struct.ST4* nocapture readonly %S) {			define i32 @test_struct_load4(%struct.ST4* nocapture readonly %S) {
	entry:			entry:
	br label %for.body			br label %for.body

	▲ Show 20 Lines • Show All 371 Lines • ▼ Show 20 Lines
	; SB = SumB;			; SB = SumB;
	; }			; }

	; CHECK-LABEL: @int_float_struct(			; CHECK-LABEL: @int_float_struct(
	; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4			; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4
	; CHECK: %[[V0:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK: %[[V0:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: %[[V1:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK: %[[V1:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: bitcast <4 x i32> %[[V1]] to <4 x float>			; CHECK: bitcast <4 x i32> %[[V1]] to <4 x float>
	; CHECK: add nsw <4 x i32>			; CHECK: add <4 x i32>
	; CHECK: fadd fast <4 x float>			; CHECK: fadd fast <4 x float>

	%struct.IntFloat = type { i32, float }			%struct.IntFloat = type { i32, float }

	@SA = common global i32 0, align 4			@SA = common global i32 0, align 4
	@SB = common global float 0.000000e+00, align 4			@SB = common global float 0.000000e+00, align 4

	define void @int_float_struct(%struct.IntFloat* nocapture readonly %A) #0 {			define void @int_float_struct(%struct.IntFloat* nocapture readonly %A) #0 {
	▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2			; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2
	; CHECK: store i32 %[[X2:.+]], {{.*}}			; CHECK: store i32 %[[X2:.+]], {{.*}}
	; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4			; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4
	; CHECK: store i32 %[[X3:.+]], {{.*}}			; CHECK: store i32 %[[X3:.+]], {{.*}}
	; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6			; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6
	; CHECK: store i32 %[[X4:.+]], {{.*}}			; CHECK: store i32 %[[X4:.+]], {{.*}}
	; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}			; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
	; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: add nsw <4 x i32> %[[S1]], %[[Phi]]			; CHECK: add <4 x i32> %[[S1]], %[[Phi]]

	define i32 @PR27626_1(%pair.i32 *%p, i64 %n) {			define i32 @PR27626_1(%pair.i32 *%p, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]			%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
	%s = phi i32 [ %2, %for.body ], [ 0, %entry ]			%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2			; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2
	; CHECK: store i32 %[[X2:.+]], {{.*}}			; CHECK: store i32 %[[X2:.+]], {{.*}}
	; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4			; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4
	; CHECK: store i32 %[[X3:.+]], {{.*}}			; CHECK: store i32 %[[X3:.+]], {{.*}}
	; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6			; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6
	; CHECK: store i32 %[[X4:.+]], {{.*}}			; CHECK: store i32 %[[X4:.+]], {{.*}}
	; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}			; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
	; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: add nsw <4 x i32> %[[S1]], %[[Phi]]			; CHECK: add <4 x i32> %[[S1]], %[[Phi]]

	define i32 @PR27626_3(%pair.i32 *%p, i64 %n, i32 %z) {			define i32 @PR27626_3(%pair.i32 *%p, i64 %n, i32 %z) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]			%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
	%s = phi i32 [ %2, %for.body ], [ 0, %entry ]			%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
	▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/no_int_induction.ll

; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s		; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s

; int __attribute__((noinline)) sum_array(int *A, int n) {		; int __attribute__((noinline)) sum_array(int *A, int n) {
; return std::accumulate(A, A + n, 0);		; return std::accumulate(A, A + n, 0);
; }		; }

target datalayout = "e-p:64:64:64-p1:16:16:16-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-p1:16:16:16-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-n8:16:32:64-S128"

;CHECK-LABEL: @sum_array(		;CHECK-LABEL: @sum_array(
;CHECK: phi i64		;CHECK: phi i64
;CHECK: phi <4 x i32>		;CHECK: phi <4 x i32>
;CHECK: load <4 x i32>		;CHECK: load <4 x i32>
;CHECK: add nsw <4 x i32>		;CHECK: add <4 x i32>
;CHECK: ret i32		;CHECK: ret i32
define i32 @sum_array(i32* %A, i32 %n) nounwind uwtable readonly noinline ssp {		define i32 @sum_array(i32* %A, i32 %n) nounwind uwtable readonly noinline ssp {
%1 = sext i32 %n to i64		%1 = sext i32 %n to i64
%2 = getelementptr inbounds i32, i32* %A, i64 %1		%2 = getelementptr inbounds i32, i32* %A, i64 %1
%3 = icmp eq i32 %n, 0		%3 = icmp eq i32 %n, 0
br i1 %3, label %_ZSt10accumulateIPiiET0_T_S2_S1_.exit, label %.lr.ph.i		br i1 %3, label %_ZSt10accumulateIPiiET0_T_S2_S1_.exit, label %.lr.ph.i

.lr.ph.i: ; preds = %0, %.lr.ph.i		.lr.ph.i: ; preds = %0, %.lr.ph.i
Show All 10 Lines	_ZSt10accumulateIPiiET0_T_S2_S1_.exit: ; preds = %.lr.ph.i, %0
ret i32 %.01.lcssa.i		ret i32 %.01.lcssa.i
}		}

; Same, but use a pointer with a different size.		; Same, but use a pointer with a different size.
;CHECK-LABEL: @sum_array_as1(		;CHECK-LABEL: @sum_array_as1(
;CHECK: phi i16		;CHECK: phi i16
;CHECK: phi <4 x i32>		;CHECK: phi <4 x i32>
;CHECK: load <4 x i32>		;CHECK: load <4 x i32>
;CHECK: add nsw <4 x i32>		;CHECK: add <4 x i32>
;CHECK: ret i32		;CHECK: ret i32
define i32 @sum_array_as1(i32 addrspace(1)* %A, i32 %n) nounwind uwtable readonly noinline ssp {		define i32 @sum_array_as1(i32 addrspace(1)* %A, i32 %n) nounwind uwtable readonly noinline ssp {
%1 = sext i32 %n to i64		%1 = sext i32 %n to i64
%2 = getelementptr inbounds i32, i32 addrspace(1)* %A, i64 %1		%2 = getelementptr inbounds i32, i32 addrspace(1)* %A, i64 %1
%3 = icmp eq i32 %n, 0		%3 = icmp eq i32 %n, 0
br i1 %3, label %_ZSt10accumulateIPiiET0_T_S2_S1_.exit, label %.lr.ph.i		br i1 %3, label %_ZSt10accumulateIPiiET0_T_S2_S1_.exit, label %.lr.ph.i

.lr.ph.i: ; preds = %0, %.lr.ph.i		.lr.ph.i: ; preds = %0, %.lr.ph.i
Show All 12 Lines

llvm/test/Transforms/LoopVectorize/nuw.ll

This file was added.

				; RUN: opt %s -loop-vectorize -force-vector-interleave=2 -force-vector-width=4 -S \| FileCheck %s

				; Fixes PR43828
				AyalUnsubmitted Done Reply Inline Actions Mention PR43828, possibly in file or test name. As interleaving is set to 2, check both sub's. Ayal: Mention PR43828, possibly in file or test name. As interleaving is set to 2, check both sub's.

				define void @test(i32* %B) {
				AyalUnsubmitted Done Reply Inline Actions Tests that use x86 targets should be placed under LoopVectorize/X86 directory. Ayal: Tests that use x86 targets should be placed under LoopVectorize/X86 directory.
				; CHECK-LABEL: @test(
				; CHECK: vector.body:
				; CHECK-COUNT-2: sub <4 x i32>
				entry:
				br label %outer_loop

				outer_loop:
				%local_4 = phi i32 [ 2, %entry ], [ %4, %outer_tail]
				br label %inner_loop

				AyalUnsubmitted Done Reply Inline Actions Note that `vector.body`'s are usually CHECK'd instead of CHECK-LABEL'd, just for consistency with other tests. Consider running update_test_checks.py to auto-generate the CHECKs. Ayal: Note that `vector.body`'s are usually CHECK'd instead of CHECK-LABEL'd, just for consistency…
				inner_loop:
				%local_2 = phi i32 [ 0, %outer_loop ], [ %1, %inner_loop ]
				%local_3 = phi i32 [ -104, %outer_loop ], [ %0, %inner_loop ]
				%0 = sub nuw nsw i32 %local_3, %local_4
				%1 = add nuw nsw i32 %local_2, 1
				%2 = icmp ugt i32 %local_2, 126
				br i1 %2, label %outer_tail, label %inner_loop

				outer_tail:
				%3 = phi i32 [ %0, %inner_loop ]
				store atomic i32 %3, i32 * %B unordered, align 8
				%4 = add i32 %local_4, 1
				%5 = icmp slt i32 %4, 6
				br i1 %5, label %outer_loop, label %exit

				exit:
				ret void
				}

				define i32 @multi-instr(i32* noalias nocapture %A, i32* noalias nocapture %B, i32 %inc) {
				; CHECK-LABEL: @multi-instr(
				; CHECK: vector.body:
				; CHECK-COUNT-4: add <4 x i32>
				entry:
				br label %loop

				loop:
				AyalUnsubmitted Done Reply Inline Actions ditto. Ayal: ditto.
				%iv = phi i32 [0, %entry], [%iv_inc, %loop]
				%redu = phi i32 [0, %entry], [%3, %loop]
				%gepa = getelementptr inbounds i32, i32* %A, i32 %iv
				%gepb = getelementptr inbounds i32, i32* %B, i32 %iv
				%0 = load i32, i32* %gepa
				%1 = load i32, i32* %gepb
				%2 = add nuw nsw i32 %redu, %0
				%3 = add nuw nsw i32 %2, %1
				%iv_inc = add nuw nsw i32 %iv, 1
				%4 = icmp ult i32 %iv_inc, 128
				br i1 %4, label %loop, label %exit

				exit:
				%lcssa = phi i32 [%3, %loop]
				ret i32 %lcssa
				}

llvm/test/Transforms/LoopVectorize/reduction-small-size.ll

	; RUN: opt < %s -force-vector-width=4 -force-vector-interleave=1 -loop-vectorize -S \| FileCheck %s			; RUN: opt < %s -force-vector-width=4 -force-vector-interleave=1 -loop-vectorize -S \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	; CHECK-LABEL: @PR34687(			; CHECK-LABEL: @PR34687(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %[[LATCH:.*]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %[[LATCH:.*]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP17:%.]], %[[LATCH]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP17:%.]], %[[LATCH]] ]
	; CHECK: [[LATCH]]:			; CHECK: [[LATCH]]:
	; CHECK: [[TMP13:%.*]] = and <4 x i32> [[VEC_PHI]], <i32 255, i32 255, i32 255, i32 255>			; CHECK: [[TMP13:%.*]] = and <4 x i32> [[VEC_PHI]], <i32 255, i32 255, i32 255, i32 255>
	; CHECK-NEXT: [[TMP14:%.]] = add nuw nsw <4 x i32> [[TMP13]], {{.}}			; CHECK-NEXT: [[TMP14:%.]] = add <4 x i32> [[TMP13]], {{.}}
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
	; CHECK: [[TMP16:%.*]] = trunc <4 x i32> [[TMP14]] to <4 x i8>			; CHECK: [[TMP16:%.*]] = trunc <4 x i32> [[TMP14]] to <4 x i8>
	; CHECK-NEXT: [[TMP17]] = zext <4 x i8> [[TMP16]] to <4 x i32>			; CHECK-NEXT: [[TMP17]] = zext <4 x i8> [[TMP16]] to <4 x i32>
	; CHECK-NEXT: br i1 {{.*}}, label %middle.block, label %vector.body			; CHECK-NEXT: br i1 {{.*}}, label %middle.block, label %vector.body
	;			;
	define i8 @PR34687(i1 %c, i32 %x, i32 %n) {			define i8 @PR34687(i1 %c, i32 %x, i32 %n) {
	entry:			entry:
	br label %for.body			br label %for.body
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/reduction.ll

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	for.end: ; preds = %for.body, %entry
%x.0.lcssa = phi i32 [ 0, %entry ], [ %sub, %for.body ]		%x.0.lcssa = phi i32 [ 0, %entry ], [ %sub, %for.body ]
ret i32 %x.0.lcssa		ret i32 %x.0.lcssa
}		}


; In this test the reduction variable is on the LHS and we can vectorize it.		; In this test the reduction variable is on the LHS and we can vectorize it.
;CHECK-LABEL: @reduction_sub_lhs(		;CHECK-LABEL: @reduction_sub_lhs(
;CHECK: phi <4 x i32>		;CHECK: phi <4 x i32>
;CHECK: sub nsw <4 x i32>		;CHECK: sub <4 x i32>
;CHECK: ret i32		;CHECK: ret i32
define i32 @reduction_sub_lhs(i32 %n, i32* noalias nocapture %A) nounwind uwtable readonly {		define i32 @reduction_sub_lhs(i32 %n, i32* noalias nocapture %A) nounwind uwtable readonly {
entry:		entry:
%cmp4 = icmp sgt i32 %n, 0		%cmp4 = icmp sgt i32 %n, 0
br i1 %cmp4, label %for.body, label %for.end		br i1 %cmp4, label %for.body, label %for.end

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]		%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
▲ Show 20 Lines • Show All 269 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Strip wrap flags from vectorized reductionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 234864

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

llvm/test/Transforms/LoopVectorize/if-pred-stores.ll

llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll

llvm/test/Transforms/LoopVectorize/no_int_induction.ll

llvm/test/Transforms/LoopVectorize/nuw.ll

llvm/test/Transforms/LoopVectorize/reduction-small-size.ll

llvm/test/Transforms/LoopVectorize/reduction.ll

[LV] Strip wrap flags from vectorized reductions
ClosedPublic