This is an archive of the discontinued LLVM Phabricator instance.

[LoopDeletion] Separate logic in breakBackedgeIfNotTaken using symboic max trip count [nfc]
ClosedPublic

Authored by reames on Aug 27 2021, 2:40 PM.

Details

Summary

As mentioned in D108833, the logic for figuring out if a backedge is dead was somewhat interwoven with the SCEV based logic and the symbolic eval logic. This is my attempt at making the code easier to follow.

Diff Detail

Event Timeline

reames created this revision.Aug 27 2021, 2:40 PM
reames requested review of this revision.Aug 27 2021, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 27 2021, 2:40 PM
nikic added a comment.Aug 28 2021, 2:02 AM

This looks reasonable to me, but doesn't seem to be NFC in practice. Your previous change in D108833 resulted in a minor code size decrease in CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Common/ArchiveCommandLine.cpp.o and CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Console/Main.cpp.o at O3, and it looks like this change restores the previous size. I strongly suspect that this is just another SCEV invalidation artifact though.

mkazantsev accepted this revision.Aug 29 2021, 9:55 PM

LGTM, but I'm also not sure if it's an NFC.

This revision is now accepted and ready to land.Aug 29 2021, 9:55 PM
reames added a comment.EditedAug 30 2021, 8:31 AM

This looks reasonable to me, but doesn't seem to be NFC in practice. Your previous change in D108833 resulted in a minor code size decrease in CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Common/ArchiveCommandLine.cpp.o and CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Console/Main.cpp.o at O3, and it looks like this change restores the previous size. I strongly suspect that this is just another SCEV invalidation artifact though.

My best guess here is that we have a case where the trip count logic is able to compute a constant maximum of zero, but for which the exact symbolic trip count is not zero. I am uncomfortable landing this with the regression noted because I don't know how broadly this triggers.

Any chance you're setup to easily capture the IR (even just the whole module) for one of these? I'd prefer not to have to get the test-suite setup again, it tends to be pretty time consuming.

nikic added a comment.Aug 30 2021, 9:38 AM

This looks reasonable to me, but doesn't seem to be NFC in practice. Your previous change in D108833 resulted in a minor code size decrease in CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Common/ArchiveCommandLine.cpp.o and CMakeFiles/7zip-benchmark.dir/CPP/7zip/UI/Console/Main.cpp.o at O3, and it looks like this change restores the previous size. I strongly suspect that this is just another SCEV invalidation artifact though.

My best guess here is that we have a case where the trip count logic is able to compute a constant maximum of zero, but for which the exact symbolic trip count is not zero. I am uncomfortable landing this with the regression noted because I don't know how broadly this triggers.

Any chance you're setup to easily capture the IR (even just the whole module) for one of these? I'd prefer not to have to get the test-suite setup again, it tends to be pretty time consuming.

Here's the unoptimized IR for both files: https://gist.github.com/nikic/3df8cd975e6a59d287dc00687be1ce27 opt -O3 reproduces the difference.

Here's the unoptimized IR for both files: https://gist.github.com/nikic/3df8cd975e6a59d287dc00687be1ce27 opt -O3 reproduces the difference.

Thanks. It turns out my guess didn't cover the cases seen in practice, bugpointing and reducing now.

Digging into the ArchiveCommandLine.ll case, the symbolic expression we compute (instead of the constant max zero) is (-1 + (zext i32 (1 umax (1 smin %29)) to i64))<nsw>. This comes from the howFarToZero codepath. I tried bugpointing this, and the reduced example seems to depend on overflow logic again, but the reduced example is different enough that I'm not 100% sure the unreduced case is the same. It could simply be a missing combine. Either way, we end up with another case where max == 0, and exact = <some symbolic possibly zero value>.

Given this seems to be a broader pattern, I tried tweaking the constructor of ExitLimit to set the exact trip count to zero when the max trip count is zero. Oddly, that causes regression in some Thumb2 code placement tests. I have no idea why just yet.

My proposal is to land D108921 (once reviewed), then understand the Thumb regression, fix that, and then sink the zero check into ExitLimit, then land this patch.

D109015 should address the code size differences found by Nikita. Once that has landed, I will also land this one as it's already been LGTMed.

Took a quick look at this. No obvious relation, I'd guess some type of pass ordering and/or analysis invalidation. If you can reduce me something which changes behavior with this pass, I'll take another look.

Took a quick look at this. No obvious relation, I'd guess some type of pass ordering and/or analysis invalidation. If you can reduce me something which changes behavior with this pass, I'll take another look.

Looks like another potential code change has been tracked down to this commit.

For the reproducer below, loop-deletion replaces the branch in for.inc8 with an unconditional one with the patch reverted:

> for.inc8:                                         ; preds = %for.inc8.loopexit, %for.cond4thread-pre-split
31c40
<   br i1 %exitcond.not, label %cleanup, label %for.body, !llvm.loop !11
---
>   br label %cleanup
@a = internal unnamed_addr global i1 false, align 4
@b = internal unnamed_addr global i32 0, align 4

define i32 @main() {
entry:
  br label %for.body

for.body:                                         ; preds = %entry, %for.inc8
  %d.015 = phi i32 [ 0, %entry ], [ %inc9, %for.inc8 ]
  %.b = load i1, i1* @a, align 4
  %conv1 = select i1 %.b, i32 -128, i32 1
  %cmp2.not = icmp ugt i32 %conv1, %d.015
  br i1 %cmp2.not, label %cleanup, label %for.cond4thread-pre-split

for.cond4thread-pre-split:                        ; preds = %for.body
  %.pr = load i32, i32* @b, align 4
  %cmp514 = icmp slt i32 %.pr, 1
  br i1 %cmp514, label %for.body7, label %for.inc8

for.body7:                                        ; preds = %for.cond4thread-pre-split, %for.body7
  tail call void @foo() #2
  %0 = load i32, i32* @b, align 4
  %inc = add nsw i32 %0, 1
  store i32 %inc, i32* @b, align 4
  %cmp5 = icmp slt i32 %0, 0
  br i1 %cmp5, label %for.body7, label %for.inc8, !llvm.loop !9

for.inc8:                                         ; preds = %for.body7, %for.cond4thread-pre-split
  %inc9 = add nuw nsw i32 %d.015, 1
  %exitcond.not = icmp eq i32 %inc9, 19
  br i1 %exitcond.not, label %cleanup, label %for.body, !llvm.loop !11

cleanup:                                          ; preds = %for.body, %for.inc8
  store i1 true, i1* @a, align 4
  ret i32 0
}

declare void @foo() local_unnamed_addr #1

!9 = distinct !{!9, !10}
!10 = !{!"llvm.loop.mustprogress"}
!11 = distinct !{!11, !10}

I still don't see anything obvious, but am going to take a deeper look. If I don't find a cause for the just-reported regression, I'll revert the original change.

Well, it turns out the answer is that I'm a bloody idiot. The code is definitely wrong, not sure why I didn't notice this before.

The issue is that we use isKnownNonZero to filter the canProveOnFirstIteration call. This filtering depends on a *lower bound* of the actual trip count. My change passed an *upper bound*. That results in us failing to call canProveOnFirstIteration for cases we can prove a loop exit is taken.

Working on a fix now. May end up just reverting depending on whether the fix is reasonable clean on the newer code or not.

Change reverted. I added a test case to demonstrate the issue with this patch so that we don't make the same mistake in the future.

In retrospect, I clearly should have reverted this earlier. My apologies.