This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
1
LoopInterchange.cpp
-
test/Transforms/LoopInterchange/
-
Transforms/
-
LoopInterchange/
1
multilevel-partial-reduction.ll

Differential D134930

[LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizable
ClosedPublic

Authored by congzhe on Sep 29 2022, 5:46 PM.

Download Raw Diff

Details

Reviewers

bmahjour
Meinersbur

Group Reviewers

Restricted Project

Commits

rG75b33d6bd518: [LoopInterchange] Check phis in all subloops

Summary

This is the bugfix to the miscompile mentioned in https://reviews.llvm.org/D132055#3814831.

The IR that reproduced the bug is added as the test case in this patch. The problem is that this is a triply nested loop, and loop interchange has interchanged the outermost and the middle loop, which it should not do. This is because the innermost loop is doing reduction, and the reduction is performed with intrinsics and store instructions instead of phi nodes, i.e., there is not a reduction phi node in the middle loop header.

We should prevent interchange for such cases. Currently however, the pass does interchange because the findInductionAndReductions() function only checks phi nodes in InnerLoop and OuterLoop, which is the middle loop and the outermost loop in the problematic IR. The phi node in the innermost loop that corresponds to the reduction operation, i.e., %vec.phi = phi <4 x i32> [ %0, %for.cond4.preheader.i ], [ %16, %vector.body ], is not checked. We should check it and let findInductionAndReductions() return false since that phi node %vec.phi is not something that the pass can handle.

What this patch does is that, instead of checking the phi nodes only in InnerLoop and OuterLoop, we check all subloops of the OuterLoop. This way we not only check the phi nodes in the outermost and the middle loop, but also check the innermost loop as well. And we'll bail from findInductionAndReductions() when checking the innermost loop, thus not interchanging the middle and the outermost loop.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

congzhe created this revision.Sep 29 2022, 5:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 5:46 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

congzhe requested review of this revision.Sep 29 2022, 5:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 5:46 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B189575: Diff 464116.Sep 29 2022, 7:17 PM

congzhe edited the summary of this revision. (Show Details)Sep 29 2022, 9:12 PM

congzhe added reviewers: bmahjour, Meinersbur, Restricted Project, uabelho.

congzhe set the repository for this revision to rG LLVM Github Monorepo.

congzhe retitled this revision from [LoopInterchange] Do not interchange when reduction in subloops is not recognizable to [LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizable.

congzhe mentioned this in D132055: [LoopInterchange][PR57148] Ensure LCSSA form after loop interchnange.

I've verified that the miscompile I saw disappears with this patch. Thanks!

I'll leave for someone who actually knows this code to decide if it's the right solution though.

LGTM. Thank you.

The additional check make sense.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
837	[nit] LLVM does not use Almost-Always-Auto,
llvm/test/Transforms/LoopInterchange/multilevel-partial-reduction.ll
2	The the pass `-basic-aa` doesn't seem to exist anymore, might be a recent change.

This revision is now accepted and ready to land.Oct 18 2022, 11:08 PM

In D134930#3867354, @Meinersbur wrote:

LGTM. Thank you.

The additional check make sense.

Thank you Michael @Meinersbur for the review! I've updated the patch slightly to address your comments.

Hi Bardia @bmahjour, I remember you wanted to take a look at the problematic IR if we interchange the loopnest in multilevel-partial-reduction.ll (the test file in this patch). Here it is - if we interchange the outermost and the middle loop, after interchange the IR becomes the one as shown below. Like @uabelho said in https://reviews.llvm.org/D132055#3814831, before interchange we read arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1, then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far in store i32 %18, ptr %arrayidx14.i, align 1: But after interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again, so we throw away the calculated addition and just read the same values again. I'd appreciate it if you have further comments on it.

define i32 @test7() {
entry:
  br label %for.cond4.preheader.i.preheader

for.cond1.preheader.i.preheader:                  ; preds = %for.cond4.preheader.i
  br label %for.cond1.preheader.i

for.cond1.preheader.i:                            ; preds = %for.cond1.preheader.i.preheader, %for.inc19.i
  %i.011.i = phi i16 [ %inc20.i, %for.inc19.i ], [ 0, %for.cond1.preheader.i.preheader ]
  br label %for.cond4.preheader.i.split

for.cond4.preheader.i.preheader:                  ; preds = %entry
  br label %for.cond4.preheader.i

for.cond4.preheader.i:                            ; preds = %for.cond4.preheader.i.preheader, %middle.block
  %j.010.i = phi i16 [ %inc17.i, %middle.block ], [ 0, %for.cond4.preheader.i.preheader ]
  br label %for.cond1.preheader.i.preheader

for.cond4.preheader.i.split:                      ; preds = %for.cond1.preheader.i
  %arrayidx14.i = getelementptr inbounds [2 x [4 x i32]], ptr @c, i16 0, i16 %i.011.i, i16 %j.010.i
  %arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1
  %0 = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 %arrayidx14.promoted.i, i64 0
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %for.cond4.preheader.i.split
  %index = phi i16 [ 0, %for.cond4.preheader.i.split ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ %0, %for.cond4.preheader.i.split ], [ %16, %vector.body ]
  %1 = or i16 %index, 1
  %2 = or i16 %index, 2
  %3 = or i16 %index, 3
  %4 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %index, i16 %j.010.i
  %5 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %1, i16 %j.010.i
  %6 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %2, i16 %j.010.i
  %7 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %3, i16 %j.010.i
  %8 = load i32, ptr %4, align 1
  %9 = load i32, ptr %5, align 1
  %10 = load i32, ptr %6, align 1
  %11 = load i32, ptr %7, align 1
  %12 = insertelement <4 x i32> poison, i32 %8, i64 0
  %13 = insertelement <4 x i32> %12, i32 %9, i64 1
  %14 = insertelement <4 x i32> %13, i32 %10, i64 2
  %15 = insertelement <4 x i32> %14, i32 %11, i64 3
  %16 = add <4 x i32> %15, %vec.phi
  %index.next = add nuw i16 %index, 4
  %17 = icmp eq i16 %index.next, 512
  br i1 %17, label %for.inc19.i, label %vector.body

middle.block:                                     ; preds = %for.inc19.i
  %.lcssa.lcssa = phi <4 x i32> [ %.lcssa, %for.inc19.i ]
  %arrayidx14.i.lcssa = phi ptr [ %arrayidx14.i, %for.inc19.i ]
  %18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %.lcssa.lcssa)
  store i32 %18, ptr %arrayidx14.i.lcssa, align 1
  %inc17.i = add nuw nsw i16 %j.010.i, 1
  %exitcond12.not.i = icmp eq i16 %inc17.i, 4
  br i1 %exitcond12.not.i, label %test.exit, label %for.cond4.preheader.i

for.inc19.i:                                      ; preds = %vector.body
  %.lcssa = phi <4 x i32> [ %16, %vector.body ]
  %inc20.i = add nuw nsw i16 %i.011.i, 1
  %exitcond13.not.i = icmp eq i16 %inc20.i, 2
  br i1 %exitcond13.not.i, label %middle.block, label %for.cond1.preheader.i

test.exit:                                        ; preds = %middle.block
  %19 = load i32, ptr @c, align 1
  ret i32 %19
}

Harbormaster completed remote builds in B195406: Diff 472201.Oct 31 2022, 7:58 PM

The current output from loop interchange is wrong, but it could be corrected if we move the vector.reduce and the store to the for.inc19.i block; ie:

middle.block:                                     ; preds = %for.inc19.i
  %inc17.i = add nuw nsw i16 %j.010.i, 1
  %exitcond12.not.i = icmp eq i16 %inc17.i, 4
  br i1 %exitcond12.not.i, label %test.exit, label %for.j

for.inc19.i:                                      ; preds = %vector.body
  %.lcssa = phi <4 x i32> [ %16, %vector.body ]
  %18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %.lcssa)
  store i32 %18, ptr %arrayidx14.i, align 1
  %inc20.i = add nuw nsw i16 %i.011.i, 1
  %exitcond13.not.i = icmp eq i16 %inc20.i, 2
  br i1 %exitcond13.not.i, label %middle.block, label %for.i

If we look at the middle.block in the input IR, it is serving two purposes at the same time: 1) it acts as the j-loop latch 2) and it acts as the epilogue of the vector.body loop (containing vector.reduce and store). LoopInterchange seems to want to keep middle.block as the j-loop latch in the interchanged IR, but it fails to separate the epilogue portion from the latch computation. Perhaps a better fix would be to migrate any intervening code that is not used for computing the latch from the middle.block to the i-loop latch block.

According to the discussion at the LoopOptWG, I will land the patch to fix the miscompile for now. Will look into how to fully support such reduction scenarios as the next steps.

This revision was landed with ongoing or failed builds.Nov 3 2022, 9:21 PM

Closed by commit rG75b33d6bd518: [LoopInterchange] Check phis in all subloops (authored by congzhe). · Explain Why

This revision was automatically updated to reflect the committed changes.

congzhe added a commit: rG75b33d6bd518: [LoopInterchange] Check phis in all subloops.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LoopInterchange.cpp

32 lines

test/

Transforms/

LoopInterchange/

multilevel-partial-reduction.ll

77 lines

Diff 473118

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

Show First 20 Lines • Show All 825 Lines • ▼ Show 20 Lines

ORE->emit([&]() {

OuterLoop->getHeader())

<< "Only outer loops with induction or reduction PHI nodes can be"

" interchanged currently.";

});

return true;

}

Inductions.clear();

if (!findInductionAndReductions(InnerLoop, Inductions, nullptr)) {

// For multi-level loop nests, make sure that all phi nodes for inner loops

// at all levels can be recognized as a induction or reduction phi. Bail out

// if a phi node at a certain nesting level cannot be properly recognized.

Loop *CurLevelLoop = OuterLoop;

MeinersburUnsubmitted

Not Done

// if a phi node at a certain nesting level cannot be properly recognized.

- auto *CurLevelLoop = OuterLoop;

+ Loop *CurLevelLoop = OuterLoop;

while (!CurLevelLoop->getSubLoops().empty()) {

[nit] LLVM does not use Almost-Always-Auto,

Meinersbur: [nit] LLVM does not use Almost-Always-Auto,

while (!CurLevelLoop->getSubLoops().empty()) {

// We already made sure that the loop nest is tightly nested.

CurLevelLoop = CurLevelLoop->getSubLoops().front();

if (!findInductionAndReductions(CurLevelLoop, Inductions, nullptr)) {

LLVM_DEBUG(

dbgs() << "Only inner loops with induction or reduction PHI nodes "

<< "are supported currently.\n");

ORE->emit([&]() {

return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedPHIInner",

InnerLoop->getStartLoc(),

CurLevelLoop->getStartLoc(),

InnerLoop->getHeader())

CurLevelLoop->getHeader())

<< "Only inner loops with induction or reduction PHI nodes can be"

" interchange currently.";

});

return true;

}

// TODO: Triangular loops are not handled for now.

if (!isLoopStructureUnderstood()) {

LLVM_DEBUG(dbgs() << "Loop structure not understood by pass\n");

ORE->emit([&]() {

return OptimizationRemarkMissed(DEBUG_TYPE, "UnsupportedStructureInner",

InnerLoop->getStartLoc(),

InnerLoop->getHeader())

▲ Show 20 Lines • Show All 899 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopInterchange/multilevel-partial-reduction.ll

This file was added.

; RUN: opt < %s -loop-interchange -cache-line-size=4 -pass-remarks-missed='loop-interchange' -pass-remarks-output=%t -S \

; RUN: -verify-dom-info -verify-loop-info -verify-loop-lcssa

MeinersburUnsubmitted

Not Done

- ; RUN: opt < %s -basic-aa -loop-interchange -cache-line-size=64 -pass-remarks-missed='loop-interchange' -pass-remarks-output=%t -S \

+ ; RUN: opt < %s -loop-interchange -cache-line-size=64 -pass-remarks-missed='loop-interchange' -pass-remarks-output=%t -S \

; RUN: -verify-dom-info -verify-loop-info -verify-loop-lcssa

The the pass -basic-aa doesn't seem to exist anymore, might be a recent change.

Meinersbur: The the pass `-basic-aa` doesn't seem to exist anymore, might be a recent change.

; RUN: FileCheck --input-file=%t --check-prefix=REMARKS %s

@b = external global [512 x [4 x i32]]

@c = global [2 x [4 x i32]] zeroinitializer, align 1

; Check that the outermost and the middle loops are not interchanged since

; the innermost loop has a reduction operation which is however not in a form

; that loop interchange can handle. Interchanging the outermost and the

; middle loops would intervene with the reduction and cause miscompile.

; REMARKS: --- !Missed

; REMARKS-NEXT: Pass: loop-interchange

; REMARKS-NEXT: Name: UnsupportedPHIInner

; REMARKS-NEXT: Function: test7

; REMARKS: --- !Missed

; REMARKS-NEXT: Pass: loop-interchange

; REMARKS-NEXT: Name: UnsupportedPHIInner

; REMARKS-NEXT: Function: test7

define i32 @test7() {

entry:

br label %for.cond1.preheader.i

for.cond1.preheader.i: ; preds = %for.inc19.i, %entry

%i.011.i = phi i16 [ 0, %entry ], [ %inc20.i, %for.inc19.i ]

br label %for.cond4.preheader.i

for.cond4.preheader.i: ; preds = %middle.block, %for.cond1.preheader.i

%j.010.i = phi i16 [ 0, %for.cond1.preheader.i ], [ %inc17.i, %middle.block ]

%arrayidx14.i = getelementptr inbounds [2 x [4 x i32]], ptr @c, i16 0, i16 %i.011.i, i16 %j.010.i

%arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1

%0 = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 %arrayidx14.promoted.i, i64 0

br label %vector.body

vector.body: ; preds = %vector.body, %for.cond4.preheader.i

%index = phi i16 [ 0, %for.cond4.preheader.i ], [ %index.next, %vector.body ]

%vec.phi = phi <4 x i32> [ %0, %for.cond4.preheader.i ], [ %16, %vector.body ]

%1 = or i16 %index, 1

%2 = or i16 %index, 2

%3 = or i16 %index, 3

%4 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %index, i16 %j.010.i

%5 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %1, i16 %j.010.i

%6 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %2, i16 %j.010.i

%7 = getelementptr inbounds [512 x [4 x i32]], ptr @b, i16 0, i16 %3, i16 %j.010.i

%8 = load i32, ptr %4, align 1

%9 = load i32, ptr %5, align 1

%10 = load i32, ptr %6, align 1

%11 = load i32, ptr %7, align 1

%12 = insertelement <4 x i32> poison, i32 %8, i64 0

%13 = insertelement <4 x i32> %12, i32 %9, i64 1

%14 = insertelement <4 x i32> %13, i32 %10, i64 2

%15 = insertelement <4 x i32> %14, i32 %11, i64 3

%16 = add <4 x i32> %15, %vec.phi

%index.next = add nuw i16 %index, 4

%17 = icmp eq i16 %index.next, 512

br i1 %17, label %middle.block, label %vector.body

middle.block: ; preds = %vector.body

%18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %16)

store i32 %18, ptr %arrayidx14.i, align 1

%inc17.i = add nuw nsw i16 %j.010.i, 1

%exitcond12.not.i = icmp eq i16 %inc17.i, 4

br i1 %exitcond12.not.i, label %for.inc19.i, label %for.cond4.preheader.i

for.inc19.i: ; preds = %middle.block

%inc20.i = add nuw nsw i16 %i.011.i, 1

%exitcond13.not.i = icmp eq i16 %inc20.i, 2

br i1 %exitcond13.not.i, label %test.exit, label %for.cond1.preheader.i

test.exit: ; preds = %for.inc19.i

%19 = load i32, ptr @c, align 1

ret i32 %19

}

declare i32 @llvm.vector.reduce.add.v4i32(<4 x i32>)

This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizableClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 473118

llvm/lib/Transforms/Scalar/LoopInterchange.cpp

llvm/test/Transforms/LoopInterchange/multilevel-partial-reduction.ll

[LoopInterchange] Do not interchange when a reduction phi in all subloops of the outer loop is not recognizable
ClosedPublic