Page MenuHomePhabricator

[LV] Strip wrap flags from vectorized reductions
Needs ReviewPublic

Authored by dantrushin on Oct 29 2019, 6:50 AM.

Details

Summary

Wrap flags set on scalar reductions may become invalid after vectorization
After vectorization with interleaving, vectorized reduction might look
like this:

%vec.phi  = phi <4 x i32> [ <i32 -104, i32 0, i32 0, i32 0>, %vector.ph ], [ %2, %vector.body ]
%vec.phi2 = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ %3, %vector.body ]
%2 = sub nuw nsw <4 x i32> %vec.phi, %broadcast.splat
%3 = sub nuw nsw <4 x i32> %vec.phi2, %broadcast.splat

Note that nowrap flags are invalid (0 - x wraps) and must be reset even
though they were correct in scalar case. Otherwise, InstSimplify will
throw (0-X)<nsw> out as a no-op

Fixes PR43828

Event Timeline

dantrushin created this revision.Oct 29 2019, 6:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 29 2019, 6:50 AM

I remember seeing similar patch for SLPVectorizer(?).
There the consensus was that the flags that are present on all instructions, can be preserved.
This isn't applicable here?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753
if(auto* I = dyn_cast<Instruction>(LoopVal))
  I->dropPoisonGeneratingFlags();

Shouldn't this be done outside of this loop though?

I remember seeing similar patch for SLPVectorizer(?).
There the consensus was that the flags that are present on all instructions, can be preserved.
This isn't applicable here?

Reductions are special case, IMHO. As you can see, interleaving creates new vector IV {0,-,X}, which obviously wraps,
even though original scalar {-104,-,x} did not

So I believe we have bug here.
I need a fix for it so I had to create this review to get wheels rolling, because other means (asking on llvm-dev and filing PR) did not work ;-P

dantrushin marked an inline comment as done.Oct 29 2019, 7:19 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753

No, I think I want them cleared from every interleaved part

lebedev.ri added inline comments.Nov 4 2019, 12:41 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753
  1. Still,
if(StripWrapFlags)
  cast<Instruction>(Val)->dropPoisonGeneratingFlags();
  1. Isn't the actual bug is in getOrCreateVectorValue()? why it returns such errneous instructions?
dantrushin marked an inline comment as done.Nov 5 2019, 1:49 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753
  1. But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction:
4202       Value *V = Builder.CreateNAryOp(I.getOpcode(), Ops);
4203      
4204       if (auto *VecOp = dyn_cast<Instruction>(V))
4205         VecOp->copyIRFlags(&I);
  1. getOrCreateVectorValue() simply returns cached vector value from VectorLoopValueMap. And this instruction was created by InnerLoopVectorized::widenInstruction(), which lacks necessary context. So we basically back to the original question - if we do not want to preserve valid flags at all, we can simply not copy them in widenInstruction() (VectorOp->copyIRFlags(&I, false) ). But if we do want to preserve flags where they're valid, we need a context and clear them only for reductions.

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

lebedev.ri requested changes to this revision.Nov 8 2019, 12:26 PM
lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753

But these flags are valid for Val -- it is (original) scalar instruction, which still present in original scalar loop, which serves as reminder loop after vectorization. I want to preserve flags whereever they are valid, otherwise I just would not preserve them in InnerLoopVectorizer::widenInstruction:

I do not understand.
How is the code in the current diff

if (StripWrapFlags) {
  cast<Instruction>(Val)->setHasNoUnsignedWrap(false);
  cast<Instruction>(Val)->setHasNoSignedWrap(false);
}

different from what i suggest:

if(StripWrapFlags)
  cast<Instruction>(Val)->dropPoisonGeneratingFlags();

?
Both drop NSW/NUW from Val.

This revision now requires changes to proceed.Nov 8 2019, 12:26 PM
dantrushin updated this revision to Diff 228573.Nov 9 2019, 7:04 AM
dantrushin marked an inline comment as done.
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752–3753

Sorry, I misunderstood your comment. I had an impression you still want me to hoist it out of loop.
I updated diff

Are we essentially saying that any reassociation can't preserve NSW/NUW flags?

Say, X = MAX_INT, Y = -1, and Z = 1. "t1 = X + Y; t2 = t1 + Z" does not cause signed wrap. "t1 = X + Z; t2 = t1 + Y" does, right?

If we agree on that, since vector reduction is a form of reassociation transformation. we need to drop NSW/NUW flags. We need to look at all other reassociation as well. Are we heading to that direction?

I'm not familiar with LLVM enough to answer your question. I hoped you'll tell me.
But I would expect NSW/NUW flags to be correct, since LLVM performs transformation based on them (e.g., throws some subexpressions out)
I also think that (speaking of vectorization) vector reduction is somewhat special/simpler case of reassociation and in my example, not clearing wrap flags leads to incorrect result even in case where
there cannot be any overflow issues in scalar case.
For other compiler, I would simply clear NSW/NUW flags in InnerLoopVectorizer::widenInstruction() and recompute them afresh after vectorization,
but my understanding is that LLVM tries to preserve as much analysis as possible during transformations
I need that bug fixed and I'll be grateful for any suggestions how to fix it properly.

dantrushin marked an inline comment as done.Nov 9 2019, 7:33 AM

Ping.
@lebedev.ri : did I addressed your comments?

@hsaito: What do you think?

dantrushin marked an inline comment as done.Thu, Nov 21, 6:14 AM

@hsaito: What do you think?

Checked with a few of my colleagues. They say

  1. SCEV drops a lot of no-wrap flags because of inherent reassociation issue in design
  2. InstCombine and scalar reassociate passes propagate the wrap[ flags if they can prove the flags are still valid.

So, dropping is the right thing to do. Thanks for raising the issue and fixing it.

This patch looks good to me. I agree it's nicer if we could fix it during widenInstruction, but let's leave that fix to the VPlan centric code generation, where we expect to (eventually) have the necessary context.

Ping.
@lebedev.ri : did I addressed your comments?

Yes. Though it still looks wrong to me that getOrCreateVectorValue() seems to return a value that has incorrect no-wrap flags that we need to later drop.

@hsaito: What do you think?

Thanks.
If we're done, could you approve it and commit it for me? I have no commit access yet :-/

Ayal added a subscriber: Ayal.Thu, Nov 28, 8:03 AM

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that (1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and (2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests. All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Folding the partial sums in the middle block is already done w/o wrap flags, and the scalar loop used for leftover iterations and/or runtime guard default can continue to retain wrap flags.

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?
Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

Ayal added a comment.Sun, Dec 1, 8:19 AM

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

(2) the last instruction in the loop along a reduction chain may not be one of these binary wrapping ops, but may instead be e.g. a select or phi as in llvm/test/Transforms/LoopVectorize/if-reduction.ll tests.

But as far as I understand, neither select nor phi carry no wrap flags? Or you're saying I need to look through them?

Right, need to look through select and phi nodes, and possibly other (trunc/sext) instructions.

All these instructions should be identified either late as done here in fixReduction(), or early e.g. when they receive a (new) VPWidenRecipe with a additional indicator that they must not wrap.

Doing this early would make me build reduction's instruction list in buildVPlanWithVPRecipies and for each instruction in list pass new flag to VPWidenRecipe and then (during execute()) to widenInstruction()
Again, adding this new flag to it.
Is it correct?

Yes, that may be an alternative implementation.

Patch should be accompanied by test(s), e.g., derived from pr43828, and the fixing of existing tests.

Oops, I had them initially, but lost during review update. I'll add them back.

dantrushin updated this revision to Diff 231691.Mon, Dec 2, 6:09 AM

Update according to @Ayal's comments.

Collect all instructions comprising reduction and clear their
flags if appropriate.
Add lost test/test updates

fhahn added inline comments.Mon, Dec 2, 6:11 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383

formatting seems off. Could you run clang-format on the diff?

dantrushin updated this revision to Diff 231692.Mon, Dec 2, 6:12 AM

Removed debug leftovers...

dantrushin updated this revision to Diff 231698.Mon, Dec 2, 6:28 AM

Fixed bad formatting

dantrushin marked 2 inline comments as done.Mon, Dec 2, 6:31 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
383

Thanks for catching this. Fixed

dantrushin marked an inline comment as done.
lebedev.ri resigned from this revision.Sat, Dec 7, 5:01 AM
Ayal added a comment.Sun, Dec 8, 11:44 AM

Add test(s) having multiple wrapping reduction instructions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
382

End sentence with period.

383

collectReduInstructions >> collectReductionInstructions

394

Phi was initially inserted into Result, so the check if UI==Phi is covered by the check if Result.insert(UI).second. Can fold into

if (L->contains(UI->getParent()) && Result.insert(UI).second)
  Worklist.push_back(UI);
3748

ReduList >> ReductionInstructions

3750

might become >> are in general
Start sentence with capital (Wrap) and end with period.

3752

Better interchange to loop over all Parts inside.
Can then potentially do "if (!isa<OverflowingBinaryOperator>(I)) continue;" if preferred.

llvm/test/Transforms/LoopVectorize/nuw.ll
3

Mention PR43828, possibly in file or test name.

As interleaving is set to 2, check both sub's.

5

Tests that use x86 targets should be placed under LoopVectorize/X86 directory.

dantrushin marked 9 inline comments as done.Mon, Dec 9, 8:29 AM
dantrushin added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3752

I had to fuse loops before interchanging.

dantrushin updated this revision to Diff 232869.Mon, Dec 9, 8:29 AM
dantrushin marked an inline comment as done.

Addressed @Ayal's comments

Probably not too much important because should be handled by the vector predicated instructions/intrinsics, but still

Good catch, binary operations that perform reduction must indeed be vectorized w/o wrap flags.

But this should apply to all such operations that participate in the vectorized part of the loop. Note that
(1) there may be several such add/sub instructions, as in llvm/test/Transforms/LoopVectorize/reduction.ll tests, and

Is there some existing API to find them all? Or I need to invite my own?

AFAIK such an API does not currently exist.

Would not it be easier just to not copy wrap flags in widenInstruction() for all instructions [which I was shy to do initially :) ] or it is too aggressive?

Loosing all wrap flags would be too aggressive.

Why is it ok not to drop nuw here:

define i8 @function0(i8 %a) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i32 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  %cmp5 = icmp ult i8 %a, 127
  br i1 %cmp5, label %if.then, label %if.end

if.then:
  %mul = mul nuw i8 %a, 2
  br label %if.end

if.end:
  %k = phi i8 [ %mul, %if.then ], [ %a, %for.body ]
  %indvars.iv.next = add i32 %indvars.iv, 1
  %cmp = icmp slt i32 %indvars.iv.next, 42
  br i1 %cmp, label %for.body, label %for.end

for.end:
  ret i8 undef
}

Vector code generated is

vector.ph:                                        ; preds = %entry
  %broadcast.splatinsert1 = insertelement <4 x i8> undef, i8 %a, i32 0
  %broadcast.splat2 = shufflevector <4 x i8> %broadcast.splatinsert1, <4 x i8> undef, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
  %0 = add i32 %index, 0
  %1 = icmp ult <4 x i8> %broadcast.splat2, <i8 -128, i8 -128, i8 -128, i8 -128>
  %2 = mul nuw <4 x i8> %broadcast.splat2, <i8 2, i8 2, i8 2, i8 2>                                  ; if %a == 200, this is poison...
  %3 = xor <4 x i1> %1, <i1 true, i1 true, i1 true, i1 true>
  %predphi = select <4 x i1> %3, <4 x i8> %broadcast.splat2, <4 x i8> %2                     ; ... even though the %predphi == %a broadcasted, it's still poison as it depends on %2 (according to https://llvm.org/docs/LangRef.html#poisonvalues)
  %index.next = add i32 %index, 4
  %4 = icmp eq i32 %index.next, 40
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !0

Do I miss anything important here that allows us not to drop "nuw" flags?