This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizable
ClosedPublic

Authored by congzhe on Sep 29 2022, 5:46 PM.

Details

Reviewers
bmahjour
Meinersbur
Group Reviewers
Restricted Project
Commits
rG75b33d6bd518: [LoopInterchange] Check phis in all subloops
Summary

This is the bugfix to the miscompile mentioned in https://reviews.llvm.org/D132055#3814831.

The IR that reproduced the bug is added as the test case in this patch. The problem is that this is a triply nested loop, and loop interchange has interchanged the outermost and the middle loop, which it should not do. This is because the innermost loop is doing reduction, and the reduction is performed with intrinsics and store instructions instead of phi nodes, i.e., there is not a reduction phi node in the middle loop header.

We should prevent interchange for such cases. Currently however, the pass does interchange because the findInductionAndReductions() function only checks phi nodes in InnerLoop and OuterLoop, which is the middle loop and the outermost loop in the problematic IR. The phi node in the innermost loop that corresponds to the reduction operation, i.e., %vec.phi = phi <4 x i32> [ %0, %for.cond4.preheader.i ], [ %16, %vector.body ], is not checked. We should check it and let findInductionAndReductions() return false since that phi node %vec.phi is not something that the pass can handle.

What this patch does is that, instead of checking the phi nodes only in InnerLoop and OuterLoop, we check all subloops of the OuterLoop. This way we not only check the phi nodes in the outermost and the middle loop, but also check the innermost loop as well. And we'll bail from findInductionAndReductions() when checking the innermost loop, thus not interchanging the middle and the outermost loop.

Diff Detail

Event Timeline

congzhe created this revision.Sep 29 2022, 5:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 5:46 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
congzhe requested review of this revision.Sep 29 2022, 5:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 5:46 PM
congzhe edited the summary of this revision. (Show Details)Sep 29 2022, 9:12 PM
congzhe added reviewers: bmahjour, Meinersbur, Restricted Project, uabelho.
congzhe set the repository for this revision to rG LLVM Github Monorepo.
congzhe retitled this revision from [LoopInterchange] Do not interchange when reduction in subloops is not recognizable to [LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizable.
uabelho resigned from this revision.Sep 29 2022, 11:32 PM

I've verified that the miscompile I saw disappears with this patch. Thanks!

I'll leave for someone who actually knows this code to decide if it's the right solution though.

Meinersbur accepted this revision.Oct 18 2022, 11:08 PM

LGTM. Thank you.

The additional check make sense.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
837

[nit] LLVM does not use Almost-Always-Auto,

llvm/test/Transforms/LoopInterchange/multilevel-partial-reduction.ll
2

The the pass -basic-aa doesn't seem to exist anymore, might be a recent change.

This revision is now accepted and ready to land.Oct 18 2022, 11:08 PM
congzhe updated this revision to Diff 472201.Oct 31 2022, 7:09 PM
congzhe removed a reviewer: uabelho.
congzhe added a subscriber: uabelho.

LGTM. Thank you.

The additional check make sense.

Thank you Michael @Meinersbur for the review! I've updated the patch slightly to address your comments.

Hi Bardia @bmahjour, I remember you wanted to take a look at the problematic IR if we interchange the loopnest in multilevel-partial-reduction.ll (the test file in this patch). Here it is - if we interchange the outermost and the middle loop, after interchange the IR becomes the one as shown below. Like @uabelho said in https://reviews.llvm.org/D132055#3814831, before interchange we read arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1, then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far in store i32 %18, ptr %arrayidx14.i, align 1: But after interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again, so we throw away the calculated addition and just read the same values again. I'd appreciate it if you have further comments on it.

define i32 @test7() {
entry:
  br label %for.cond4.preheader.i.preheader

for.cond1.preheader.i.preheader:                  ; preds = %for.cond4.preheader.i
  br label %for.cond1.preheader.i

for.cond1.preheader.i:                            ; preds = %for.cond1.preheader.i.preheader, %for.inc19.i
  %i.011.i = phi i16 [ %inc20.i, %for.inc19.i ], [ 0, %for.cond1.preheader.i.preheader ]
  br label %for.cond4.preheader.i.split

for.cond4.preheader.i.preheader:                  ; preds = %entry
  br label %for.cond4.preheader.i

for.cond4.preheader.i:                            ; preds = %for.cond4.preheader.i.preheader, %middle.block
  %j.010.i = phi i16 [ %inc17.i, %middle.block ], [ 0, %for.cond4.preheader.i.preheader ]
  br label %for.cond1.preheader.i.preheader

for.cond4.preheader.i.split:                      ; preds = %for.cond1.preheader.i
  %arrayidx14.i = getelementptr inbounds [2 x [4 x i32]], ptr @c, i16 0, i16 %i.011.i, i16 %j.010.i
  %arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1
  %0 = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 %arrayidx14.promoted.i, i64 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %for.cond4.preheader.i.split
  %index = phi i16 [ 0, %for.cond4.preheader.i.split ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ %0, %for.cond4.preheader.i.split ], [ %16, %vector.body ]
  %1 = or i16 %index, 1
  %2 = or i16 %index, 2
  %3 = or i16 %index, 3
  %4 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %index, i16 %j.010.i
  %5 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %1, i16 %j.010.i
  %6 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %2, i16 %j.010.i
  %7 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %3, i16 %j.010.i
  %8 = load i32, ptr %4, align 1
  %9 = load i32, ptr %5, align 1
  %10 = load i32, ptr %6, align 1
  %11 = load i32, ptr %7, align 1
  %12 = insertelement <4 x i32> poison, i32 %8, i64 0
  %13 = insertelement <4 x i32> %12, i32 %9, i64 1
  %14 = insertelement <4 x i32> %13, i32 %10, i64 2
  %15 = insertelement <4 x i32> %14, i32 %11, i64 3
  %16 = add <4 x i32> %15, %vec.phi
  %index.next = add nuw i16 %index, 4
  %17 = icmp eq i16 %index.next, 512
  br i1 %17, label %for.inc19.i, label %vector.body

middle.block:                                     ; preds = %for.inc19.i
  %.lcssa.lcssa = phi <4 x i32> [ %.lcssa, %for.inc19.i ]
  %arrayidx14.i.lcssa = phi ptr [ %arrayidx14.i, %for.inc19.i ]
  %18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %.lcssa.lcssa)
  store i32 %18, ptr %arrayidx14.i.lcssa, align 1
  %inc17.i = add nuw nsw i16 %j.010.i, 1
  %exitcond12.not.i = icmp eq i16 %inc17.i, 4
  br i1 %exitcond12.not.i, label %test.exit, label %for.cond4.preheader.i

for.inc19.i:                                      ; preds = %vector.body
  %.lcssa = phi <4 x i32> [ %16, %vector.body ]
  %inc20.i = add nuw nsw i16 %i.011.i, 1
  %exitcond13.not.i = icmp eq i16 %inc20.i, 2
  br i1 %exitcond13.not.i, label %middle.block, label %for.cond1.preheader.i

test.exit:                                        ; preds = %middle.block
  %19 = load i32, ptr @c, align 1
  ret i32 %19
}

The current output from loop interchange is wrong, but it could be corrected if we move the vector.reduce and the store to the for.inc19.i block; ie:

middle.block:                                     ; preds = %for.inc19.i
  %inc17.i = add nuw nsw i16 %j.010.i, 1
  %exitcond12.not.i = icmp eq i16 %inc17.i, 4
  br i1 %exitcond12.not.i, label %test.exit, label %for.j

for.inc19.i:                                      ; preds = %vector.body
  %.lcssa = phi <4 x i32> [ %16, %vector.body ]
  %18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %.lcssa)
  store i32 %18, ptr %arrayidx14.i, align 1
  %inc20.i = add nuw nsw i16 %i.011.i, 1
  %exitcond13.not.i = icmp eq i16 %inc20.i, 2
  br i1 %exitcond13.not.i, label %middle.block, label %for.i

If we look at the middle.block in the input IR, it is serving two purposes at the same time: 1) it acts as the j-loop latch 2) and it acts as the epilogue of the vector.body loop (containing vector.reduce and store). LoopInterchange seems to want to keep middle.block as the j-loop latch in the interchanged IR, but it fails to separate the epilogue portion from the latch computation. Perhaps a better fix would be to migrate any intervening code that is not used for computing the latch from the middle.block to the i-loop latch block.

According to the discussion at the LoopOptWG, I will land the patch to fix the miscompile for now. Will look into how to fully support such reduction scenarios as the next steps.

This revision was landed with ongoing or failed builds.Nov 3 2022, 9:21 PM
This revision was automatically updated to reflect the committed changes.