This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange][PR57148] Ensure LCSSA form after loop interchnange
ClosedPublic

Authored by congzhe on Aug 17 2022, 11:34 AM.

Details

Summary

This patch resolves pr57148 (https://github.com/llvm/llvm-project/issues/57148) which is an assertion error due to of loss of LCSSA form after interchange.

In cases where the LCSSA form is not maintained after interchange (e.g., interchanging the middle loop and the outermost loop in the test case added in this patch), we change the IR to LCSSA form again.

The test case added is the reproducer of pr57148.

Diff Detail

Event Timeline

congzhe created this revision.Aug 17 2022, 11:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2022, 11:34 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
congzhe requested review of this revision.Aug 17 2022, 11:34 AM

I have no idea if this is the right fix, but I've verified that it solves the problem that I reported in pr 57148.
Thanks!

However, that was a reduced version of the problem, and if I try the original non-reduced version (unfortunately for an out-of-tree target) I get another failure with this patch.
So it might be a step in the right direction, but there are more problems lurking.

A reduced version of the new problem is

opt -passes="loop-interchange" bbi-72571_2.ll -o /dev/null

which results in

Instruction does not dominate all uses!
  %i.166.lcssa = phi i16 [ %3, %vector.body85.split ]
  %arrayidx60 = getelementptr inbounds [2 x [4 x i32]], ptr @c, i16 0, i16 %i.166.lcssa, i16 %j.165
LLVM ERROR: Broken module found, compilation aborted!

bmahjour added inline comments.Aug 22 2022, 9:53 AM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
575–576

can outer loop still be non-lcssa?

576

why not consider innerloop also?

congzhe updated this revision to Diff 461456.EditedSep 19 2022, 7:54 PM

I have no idea if this is the right fix, but I've verified that it solves the problem that I reported in pr 57148.
Thanks!

However, that was a reduced version of the problem, and if I try the original non-reduced version (unfortunately for an out-of-tree target) I get another failure with this patch.
So it might be a step in the right direction, but there are more problems lurking.

A reduced version of the new problem is

opt -passes="loop-interchange" bbi-72571_2.ll -o /dev/null

which results in

Instruction does not dominate all uses!
  %i.166.lcssa = phi i16 [ %3, %vector.body85.split ]
  %arrayidx60 = getelementptr inbounds [2 x [4 x i32]], ptr @c, i16 0, i16 %i.166.lcssa, i16 %j.165
LLVM ERROR: Broken module found, compilation aborted!

The problem in bbi-72751_2.ll is different from the original bbi-72751.ll. The problem and solution is described as follows.

The problem is that if we have a loop nest like this:

for.outermost.header:
  %phi.outermost = ...

for.middle.header:
  %phi.middle= ...
  use of %phi.outermost

for.innermost:
  ....

If we interchange the outermost and the middle loop, it would become:

// the new outermost loop header
for.middle.header:
  %phi.middle= ...
  use of %phi.outermost

// the new middle loop header
for.outermost.header:
  %phi.outermost = ...

for.innermost:
  ....

And we'll end up with the problem that we have use of %phi.outermost before its def.

The solution is that we split the inner loop header to two basic blocks where one contains only phi instructions. We already did this splitting for the innermost loop header which usually is also the innermost loop latch (so it is a single BB that contains phis and other instructions). With this patch we do the splitting for non-innermost loops as well.

For the case above this is how we do the splitting after this patch:

for.outermost.header:
  %phi.outermost = ...

for.middle.header:
  %phi.middle= ...

for.middle.header.split:
  use of %phi.outermost

for.innermost:
  ....

And now we can correctly interchange them:

// the new outermost loop header
for.middle.header:
  %phi.middle= ...

// the new middle loop header
for.outermost.header:
  %phi.outermost = ...

for.middle.header.split:
  use of %phi.outermost

for.innermost:
  ....

bbi-72571_2.ll is added in llvm/test/Transforms/LoopInterchange/pr57148.ll as another test case. Another change in test cases is llvm/test/Transforms/LoopInterchange/pr43176-move-to-new-latch.ll but it is just simple and better clean up of BBs, there is no functional change.

congzhe added inline comments.Sep 19 2022, 7:56 PM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
576

Thanks for the comment! I've now used formLCSSARecursively() and removed the LCSSA assertion check for both inner and outer loops.

This revision is now accepted and ready to land.Sep 21 2022, 8:18 AM
Meinersbur accepted this revision.Sep 21 2022, 8:21 AM

LGTM

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
1355–1357

[suggestion] Update commend for condition: "Ensure the inner loop phi nodes have a separate basic block."

congzhe updated this revision to Diff 462074.Sep 21 2022, 8:31 PM
congzhe added inline comments.
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
1355–1357

Thanks, I've updated the patch accordingly and I will land it shortly.

This revision was landed with ongoing or failed builds.Sep 21 2022, 9:21 PM
This revision was automatically updated to reflect the committed changes.

Hello @congzhe ,

I think I'm seeing a miscompile with this patch:

opt "-passes=function(loop(loop-interchange))" bbi-74005_x86.ll -S -o -

Now I don't know how this is supposed to work but it looks to me that in the input code we read

%arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1

then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far:

%18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %16)
store i32 %18, ptr %arrayidx14.i, align 1

But after loop-interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again? So it looks like we throw away the calculated addition and just read the same values again?
Or am I missing something here?

congzhe added a comment.EditedSep 26 2022, 10:18 PM

Hello @congzhe ,

I think I'm seeing a miscompile with this patch:

opt "-passes=function(loop(loop-interchange))" bbi-74005_x86.ll -S -o -

Now I don't know how this is supposed to work but it looks to me that in the input code we read

%arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1

then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far:

%18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %16)
store i32 %18, ptr %arrayidx14.i, align 1

But after loop-interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again? So it looks like we throw away the calculated addition and just read the same values again?
Or am I missing something here?

Hi Mikael, thanks for finding it out, I'll take a look at this issue,

Hello @congzhe ,

I think I'm seeing a miscompile with this patch:

opt "-passes=function(loop(loop-interchange))" bbi-74005_x86.ll -S -o -

Now I don't know how this is supposed to work but it looks to me that in the input code we read

%arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1

then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far:

%18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %16)
store i32 %18, ptr %arrayidx14.i, align 1

But after loop-interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again? So it looks like we throw away the calculated addition and just read the same values again?
Or am I missing something here?

Hello @congzhe ,

I think I'm seeing a miscompile with this patch:

opt "-passes=function(loop(loop-interchange))" bbi-74005_x86.ll -S -o -

Now I don't know how this is supposed to work but it looks to me that in the input code we read

%arrayidx14.promoted.i = load i32, ptr %arrayidx14.i, align 1

then do 512 rounds of the inner loop and add the read elements, and then we store what we have so far:

%18 = tail call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %16)
store i32 %18, ptr %arrayidx14.i, align 1

But after loop-interchange we load, then add, but then we dont do the store, but do the load and execute the inner loop again? So it looks like we throw away the calculated addition and just read the same values again?
Or am I missing something here?

Hi Mikael, thanks for finding it out, I'll take a look at this issue,

Posted D134930 to fix it.