Page MenuHomePhabricator

[LV] Strip wrap flags from vectorized reductions
ClosedPublic

Authored by dantrushin on Oct 29 2019, 6:50 AM.

Details

Summary

Wrap flags set on scalar reductions may become invalid after vectorization
After vectorization with interleaving, vectorized reduction might look
like this:

%vec.phi  = phi <4 x i32> [ <i32 -104, i32 0, i32 0, i32 0>, %vector.ph ], [ %2, %vector.body ]
%vec.phi2 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%2 = sub nuw nsw <4 x i32> %vec.phi, %broadcast.splat
%3 = sub nuw nsw <4 x i32> %vec.phi2, %broadcast.splat

Note that nowrap flags are invalid (0 - x wraps) and must be reset even
though they were correct in scalar case. Otherwise, InstSimplify will
throw (0-X)<nsw> out as a no-op

Fixes PR43828

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dantrushin added inline comments.Oct 29 2019, 7:19 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731

No, I think I want them cleared from every interleaved part

lebedev.ri added inline comments.Nov 4 2019, 12:41 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731
  1. Still,
if(StripWrapFlags)
  cast<Instruction>(Val)->dropPoisonGeneratingFlags();
  1. Isn't the actual bug is in getOrCreateVectorValue()? why it returns such errneous instructions?
dantrushin marked an inline comment as done.Nov 5 2019, 1:49 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731
  1. But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction:
4202       Value *V = Builder.CreateNAryOp(I.getOpcode(), Ops);
4203      
4204       if (auto *VecOp = dyn_cast<Instruction>(V))
4205         VecOp->copyIRFlags(&I);
  1. getOrCreateVectorValue() simply returns cached vector value from VectorLoopValueMap. And this instruction was created by InnerLoopVectorized::widenInstruction(), which lacks necessary context. So we basically back to the original question - if we do not want to preserve valid flags at all, we can simply not copy them in widenInstruction() (VectorOp->copyIRFlags(&I, false) ). But if we do want to preserve flags where they're valid, we need a context and clear them only for reductions.

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

lebedev.ri requested changes to this revision.Nov 8 2019, 12:26 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731

But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction:

I do not understand.
How is the code in the current diff

if (StripWrapFlags) {
  cast<Instruction>(Val)->setHasNoUnsignedWrap(false);
  cast<Instruction>(Val)->setHasNoSignedWrap(false);
}

different from what i suggest:

if(StripWrapFlags)
  cast<Instruction>(Val)->dropPoisonGeneratingFlags();

?
Both drop NSW/NUW from Val.

This revision now requires changes to proceed.Nov 8 2019, 12:26 PM
dantrushin updated this revision to Diff 228573.Nov 9 2019, 7:04 AM
dantrushin marked an inline comment as done.
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730–3731

Sorry, I misunderstood your comment. I had an impression you still want me to hoist it out of loop.
I updated diff

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

I'm not familiar with LLVM enough to answer your question. I hoped you'll tell me.
But I would expect NSW/NUW flags to be correct, since LLVM performs transformation based on them (e.g., throws some subexpressions out)
I also think that (speaking of vectorization) vector reduction is somewhat special/simpler case of reassociation and in my example, not clearing wrap flags leads to incorrect result even in case where
there cannot be any overflow issues in scalar case.
For other compiler, I would simply clear NSW/NUW flags in InnerLoopVectorizer::widenInstruction() and recompute them afresh after vectorization,
but my understanding is that LLVM tries to preserve as much analysis as possible during transformations
I need that bug fixed and I'll be grateful for any suggestions how to fix it properly.

dantrushin marked an inline comment as done.Nov 9 2019, 7:33 AM

Ping.
@lebedev.ri : did I addressed your comments?

@hsaito: What do you think?

dantrushin marked an inline comment as done.Nov 21 2019, 6:14 AM

@hsaito: What do you think?

Checked with a few of my colleagues. They say

  1. SCEV drops a lot of no-wrap flags because of inherent reassociation issue in design
  2. InstCombine and scalar reassociate passes propagate the wrap[ flags if they can prove the flags are still valid.

So, dropping is the right thing to do. Thanks for raising the issue and fixing it.

This patch looks good to me. I agree it's nicer if we could fix it during widenInstruction, but let's leave that fix to the VPlan centric code generation, where we expect to (eventually) have the necessary context.

Ping.
@lebedev.ri : did I addressed your comments?

Yes. Though it still looks wrong to me that getOrCreateVectorValue() seems to return a value that has incorrect no-wrap flags that we need to later drop.

@hsaito: What do you think?

Thanks.
If we're done, could you approve it and commit it for me? I have no commit access yet :-/

Ayal added a subscriber: Ayal.Nov 28 2019, 8:03 AM

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that (1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and (2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests. All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Folding the partial sums in the middle block is already done w/o wrap flags, and the scalar loop used for leftover iterations and/or runtime guard default can continue to retain wrap flags.

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?
Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

Ayal added a comment.Dec 1 2019, 8:19 AM

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

Right, need to look through select and phi nodes, and possibly other (trunc/sext) instructions.

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Yes, that may be an alternative implementation.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

dantrushin updated this revision to Diff 231691.Dec 2 2019, 6:09 AM

Update according to @Ayal's comments.

Collect all instructions comprising reduction and clear their
flags if appropriate.
Add lost test/test updates

fhahn added inline comments.Dec 2 2019, 6:11 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383

formatting seems off. Could you run clang-format on the diff?

dantrushin updated this revision to Diff 231692.Dec 2 2019, 6:12 AM

Removed debug leftovers...

dantrushin updated this revision to Diff 231698.Dec 2 2019, 6:28 AM

Fixed bad formatting

dantrushin marked 2 inline comments as done.Dec 2 2019, 6:31 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383

Thanks for catching this. Fixed

dantrushin marked an inline comment as done.
lebedev.ri resigned from this revision.Dec 7 2019, 5:01 AM
Ayal added a comment.Dec 8 2019, 11:44 AM

Add test(s) having multiple wrapping reduction instructions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
382

End sentence with period.

383

collectReduInstructions >> collectReductionInstructions

394

Phi was initially inserted into Result, so the check if UI==Phi is covered by the check if Result.insert(UI).second. Can fold into

if (L->contains(UI->getParent()) && Result.insert(UI).second)
  Worklist.push_back(UI);
3726

ReduList >> ReductionInstructions

3728

might become >> are in general
Start sentence with capital (Wrap) and end with period.

3730

Better interchange to loop over all Parts inside.
Can then potentially do "if (!isa<OverflowingBinaryOperator>(I)) continue;" if preferred.

llvm/test/Transforms/LoopVectorize/nuw.ll
3

Mention PR43828, possibly in file or test name.

As interleaving is set to 2, check both sub's.

5

Tests that use x86 targets should be placed under LoopVectorize/X86 directory.

dantrushin marked 9 inline comments as done.Dec 9 2019, 8:29 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3730

I had to fuse loops before interchanging.

dantrushin updated this revision to Diff 232869.Dec 9 2019, 8:29 AM
dantrushin marked an inline comment as done.

Addressed @Ayal's comments

Probably not too much important because should be handled by the vector predicated instructions/intrinsics, but still

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

Why is it ok not to drop nuw here:

define i8 @function0(i8 %a) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  %cmp5 = icmp ult i8 %a, 127
  br i1 %cmp5, label %if.then, label %if.end

if.then:
  %mul = mul nuw i8 %a, 2
  br label %if.end

if.end:
  %k = phi i8 [ %mul, %if.then ], [ %a, %for.body ]
  %indvars.iv.next = add i32 %indvars.iv, 1
  %cmp = icmp slt i32 %indvars.iv.next, 42
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret i8 undef
}

Vector code generated is

vector.ph:                                        ; preds = %entry
  %broadcast.splatinsert1 = insertelement <4 x i8> undef, i8 %a, i32 0
  %broadcast.splat2 = shufflevector <4 x i8> %broadcast.splatinsert1, <4 x i8> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = add i32 %index, 0
  %1 = icmp ult <4 x i8> %broadcast.splat2, <i8 -128, i8 -128, i8 -128, i8 -128>
  %2 = mul nuw <4 x i8> %broadcast.splat2, <i8 2, i8 2, i8 2, i8 2>                                  ; if %a == 200, this is poison...
  %3 = xor <4 x i1> %1, <i1 true, i1 true, i1 true, i1 true>
  %predphi = select <4 x i1> %3, <4 x i8> %broadcast.splat2, <4 x i8> %2                     ; ... even though the %predphi == %a broadcasted, it's still poison as it depends on %2 (according to https://llvm.org/docs/LangRef.html#poisonvalues)
  %index.next = add i32 %index, 4
  %4 = icmp eq i32 %index.next, 40
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !0

Do I miss anything important here that allows us not to drop "nuw" flags?

Ayal added inline comments.Dec 17 2019, 12:21 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394

This is indeed a general way to record all transitively dependent instructions inside a loop. In our case, though, there's a single known LoopExitInst with a single (loop-closed phi) user outside the loop. More efficient to record that user and check if (UI != OutsideUser && Result.insert(UI).second) than to repeatedly check if parent block belongs to L. Agreed?

3722

Worth restricting this wrap-dropping treatment to RecurrenceKind's that may wrap, namely RK_IntegerAdd and RK_IntegerMult.

3723

RedictionInstructions >> ReductionInstructions

Addressed comments.

dantrushin marked 4 inline comments as done.Dec 17 2019, 5:05 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394

IMHO this makes code uglier. See below

Ayal added a comment.Dec 17 2019, 11:56 PM

Probably not too much important because should be handled by the vector predicated instructions/intrinsics, but still

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

Why is it ok not to drop nuw here:

define i8 @function0(i8 %a) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  %cmp5 = icmp ult i8 %a, 127
  br i1 %cmp5, label %if.then, label %if.end

if.then:
  %mul = mul nuw i8 %a, 2
  br label %if.end

if.end:
  %k = phi i8 [ %mul, %if.then ], [ %a, %for.body ]
  %indvars.iv.next = add i32 %indvars.iv, 1
  %cmp = icmp slt i32 %indvars.iv.next, 42
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret i8 undef
}

Vector code generated is

vector.ph:                                        ; preds = %entry
  %broadcast.splatinsert1 = insertelement <4 x i8> undef, i8 %a, i32 0
  %broadcast.splat2 = shufflevector <4 x i8> %broadcast.splatinsert1, <4 x i8> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = add i32 %index, 0
  %1 = icmp ult <4 x i8> %broadcast.splat2, <i8 -128, i8 -128, i8 -128, i8 -128>
  %2 = mul nuw <4 x i8> %broadcast.splat2, <i8 2, i8 2, i8 2, i8 2>                                  ; if %a == 200, this is poison...
  %3 = xor <4 x i1> %1, <i1 true, i1 true, i1 true, i1 true>
  %predphi = select <4 x i1> %3, <4 x i8> %broadcast.splat2, <4 x i8> %2                     ; ... even though the %predphi == %a broadcasted, it's still poison as it depends on %2 (according to https://llvm.org/docs/LangRef.html#poisonvalues)
  %index.next = add i32 %index, 4
  %4 = icmp eq i32 %index.next, 40
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !0

Do I miss anything important here that allows us not to drop "nuw" flags?

You are right, I stand corrected, conditional instructions that are executed speculatively need to drop their wrap flags as well. Good catch!
This is reminiscent of conditional instructions that are "assume", which need to be dropped altogether, D68814.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394

Perhaps offloading the code that searches for OutsideUser over to collectReductionInstructions() may look better, emphasizing that it's an implementation optimization internal to the collection process. That would require passing the loop again and also LoopExitInst.

Another, orthogonal option may be to fuse the loop collecting reduction instructions with the one dropping the wraps, eliminating the need for a ReductionInstructions set.

These options are more a matter of style, as you prefer. The current version and also the previous general one look good to me; we're not expecting too many UI instructions.

3720

The above comment belongs with the code below that fixes the vector-loop phi.
Instead, a comment about fixing/dropping wraps can be made here; e.g., can move the "Wrap flags are in general..." comment here.

3732

Would be good to assert here that OuterUser is not nullptr, or even better that its a (loop closed) phi.

dantrushin marked an inline comment as done.

Moved all reduction flag stuff into single procedure (I had to make
it member of InnerLoopVectorizer).

Reverted back to Loop->contains() as it seems cleanest to me and
reductions do not contain more than handful instructions usually.

dantrushin marked 4 inline comments as done.Dec 18 2019, 6:19 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
394

Indeed, doing this in a single loop is a good idea. But it needs to be a member method of InnerLoopVectorizer then.

Ayal added a comment.Dec 18 2019, 12:16 PM

No need for collectReductionInstructions() any more.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875

There's no need to deal with LoopExitInstr in this implementation; it's probably clearer to start from the Phi as in the original version.

llvm/test/Transforms/LoopVectorize/nuw.ll
15

Note that vector.body's are usually CHECK'd instead of CHECK-LABEL'd, just for consistency with other tests.

Consider running update_test_checks.py to auto-generate the CHECKs.

42

ditto.

dantrushin marked 2 inline comments as done.Dec 18 2019, 1:15 PM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875

I'm passing RecurrenceDescriptor to this function
(so I can easily check reduction type here).
I only can get loop exit instruction from it, not Phi.
Here I need *any* reduction instruction to start with, and I don't see much difference between Phi and LoopExitInstr.

Do you want me to pass in Phi instead and check reduction type at call site?

Ayal added inline comments.Dec 18 2019, 2:32 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3875

OK, right. Can pass both RdxDesc and Phi, but using LoopExitInstr is also fine.

It is tempting though to then try and save redundant Loop->contains() calls, perhaps by restricting them only to users of LoopExitInstr, as in:

if ((Cur != LoopExitInstr || OrigLoop->contains(UI->getParent())) && Visited.insert(UI).second)

?

Addressed comments:

  • removed unsused function
  • updated tests
  • added short circuit compare to LoopExitInstr
dantrushin marked 4 inline comments as done.Dec 19 2019, 3:55 AM
Ayal accepted this revision.Dec 19 2019, 4:36 AM

This looks good to me, with couple of final nits; thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
534

nit: end above sentence with period.

3875

nit: can shorten to assert(LoopExitInstr && "null loop exit instruction");

This revision is now accepted and ready to land.Dec 19 2019, 4:36 AM
dantrushin marked 2 inline comments as done.

Fixed final nits

@Ayal : Thanks! Could you commit it for me, as I have no commit access yet?

This revision was automatically updated to reflect the committed changes.
Ayal added a comment.Dec 20 2019, 5:13 AM

@Ayal : Thanks! Could you commit it for me, as I have no commit access yet?

Done. There were a couple of missing fixes to AArch64/arbitrary-induction-step.ll that were added.