Page MenuHomePhabricator

[GVN] Improve PRE on load instructions
ClosedPublic

Authored by Carrot on Jan 13 2023, 11:12 AM.

Details

Summary

This patch implements the enhancement proposed by https://github.com/llvm/llvm-project/issues/59312.

Suppose we have following code

   v0 = load %addr
   br %LoadBB

LoadBB:
   v1 = load %addr
   ...

PredBB:
   ...
   br %cond, label %LoadBB, label %SuccBB

SuccBB:
   v2 = load %addr
   ...

Instruction v1 in LoadBB is partially redundant, edge (PredBB, LoadBB) is a critical edge. SuccBB is another successor of PredBB, it contains another load v2 which is identical to v1. Current GVN splits the critical edge (PredBB, LoadBB) and inserts a new load in it. A better method is move the load of v2 into PredBB, then v1 can be changed to a PHI instruction.

If there are two or more similar predecessors, like the test case in the bug entry, current GVN simply gives up because otherwise it needs to split multiple critical edges. But we can move all loads in successor blocks into predecessors.

This is the second try of D139582, with the following enhancement.

  • Don't try to move load instructions across exception handling instructions.
  • In function replaceValuesPerBlockEntry ValuesPerBlock[BB] may not be the moved loaded value, in this case we should not replace its value.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Carrot added inline comments.Jan 19 2023, 2:32 PM
llvm/lib/Transforms/Scalar/GVN.cpp
2692

Sounds reasonable.

So now I just create a new load, replace all uses of old load with the new load. The dead old load instruction can be deleted in the next iteration of GVN.

mkazantsev accepted this revision.Jan 22 2023, 8:26 PM

Looks good now, thanks for your efforts!

This revision is now accepted and ready to land.Jan 22 2023, 8:26 PM
This revision was landed with ongoing or failed builds.Jan 25 2023, 11:48 AM
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Jan 26 2023, 3:14 AM

It looks like the new version of the patch ended up being more expensive in terms of compile-time: Original, New

I wonder whether this is because not removing the load essentially always forces an extra GVN iteration?

llvm/lib/Transforms/Scalar/GVN.cpp
1478

If we're not removing the load, we probably shouldn't be removing it from the leader table either?

It looks like the new version of the patch ended up being more expensive in terms of compile-time: Original, New

I wonder whether this is because not removing the load essentially always forces an extra GVN iteration?

I think so, it causes all triggering cases need another iteration.

llvm/lib/Transforms/Scalar/GVN.cpp
1478

Because we expect it to be deleted in the next iteration, and the same value is also available in the NewLoad instruction, so I think it should not be used by other optimizations, and assume it's not available in the leader table.

Hello!

It seems that with this patch GVN may yield different results when debug info is present.
I suspect the MaxNumInsnsPerBlock limit includes also debug instructions?
This is seen with the following example:

opt -passes=gvn bbi-78272.ll -S -o -

If you remove some dbg.declare from the input you get different output.

dstenb added a subscriber: dstenb.Jan 27 2023, 8:06 AM

@uabelho, thank you for the reporting! I think you are right. I'm preparing the patch.

hans added a subscriber: hans.Feb 1 2023, 2:47 AM

Just a heads up that we've bisected a test failure to this revision: https://bugs.chromium.org/p/chromium/issues/detail?id=1411693
We're still investigating, but it would be interesting to know if others have also hit any issues.

hans added a subscriber: reames.Feb 1 2023, 9:30 AM

I'm not sure it explains our test failure, but @reames raised a concern on the llvm-commits email:

I believe this patch to be incorrect, and need reverted.

The case I'm concerned about is that this patch appears to hoist a load 
over instructions which may throw.  The load at the original location 
may be dynamically dead because the exception path is always taken.  
Moving said load into the executed path can introduce a new fault in the 
program.

@hans, do you have a reduced test case to demonstrate the problem?

I exchanged emails with @reames

A throwing call is a terminator, can we find a load below it ?

On Mon, Jan 30, 2023 at 11:49 AM Philip Reames


<listmail@philipreames.com> wrote:
>
> You continue on any non-identical instruction.  That instruction can be
> a throwing call.
>
> Philip
>
> On 1/30/23 11:23, Carrot Wei wrote:
> > Hi Philip
> >
> > Do you have a reduced test case to show your problem?
> >
> > This patch tries to check exception throwing instructions in the
> > function findLoadToHoistIntoPred. What's your case that I missed?
> >
> > thanks
> > Guozhi Wei

At this point, please revert the patch immediately.

There's been a probable bug reported.  There's been a failure reported. 
These are probably the same.  You should revert, investigate, and fix
offline.

Philip

Carrot added a comment.Feb 1 2023, 1:27 PM

@reames, I'm happy to revert it once I get your test case, otherwise I have nothing to investigate.

slightly modified from one of the test cases:

declare i1 @foo()
declare void @maybethrow() readnone

; %v3 is partially redundant, bb3 has multiple predecessors coming through
; critical edges. The other successors of those predecessors have same loads.
; We can move all loads into predecessors.

define void @test17(ptr %p1, ptr %p2, ptr %p3, ptr %p4) {
entry:
  %v1 = load i64, ptr %p1, align 8
  %cond1 = icmp sgt i64 %v1, 200
  br i1 %cond1, label %bb200, label %bb1

bb1:
  %cond2 = icmp sgt i64 %v1, 100
  br i1 %cond2, label %bb100, label %bb2

bb2:
  %v2 = add nsw i64 %v1, 1
  store i64 %v2, ptr %p1, align 8
  br label %bb3

bb3:
  %v3 = load i64, ptr %p1, align 8
  store i64 %v3, ptr %p2, align 8
  ret void

bb100:
  %cond3 = call i1 @foo()
  br i1 %cond3, label %bb3, label %bb101

bb101:
  %v4 = load i64, ptr %p1, align 8
  store i64 %v4, ptr %p3, align 8
  ret void

bb200:
  %cond4 = call i1 @foo()
  br i1 %cond4, label %bb3, label %bb201

bb201:
  %_ = call i1 @maybethrow()
  %v5 = load i64, ptr %p1, align 8
  store i64 %v5, ptr %p4, align 8
  ret void
}

becomes

declare i1 @foo()

; Function Attrs: memory(none)
declare void @maybethrow() #0

define void @test17(ptr %p1, ptr %p2, ptr %p3, ptr %p4) {
entry:
  %v1 = load i64, ptr %p1, align 8
  %cond1 = icmp sgt i64 %v1, 200
  br i1 %cond1, label %bb200, label %bb1

bb1:                                              ; preds = %entry
  %cond2 = icmp sgt i64 %v1, 100
  br i1 %cond2, label %bb100, label %bb2

bb2:                                              ; preds = %bb1
  %v2 = add nsw i64 %v1, 1
  store i64 %v2, ptr %p1, align 8
  br label %bb3

bb3:                                              ; preds = %bb200, %bb100, %bb2
  %v3 = phi i64 [ %v3.pre, %bb200 ], [ %v3.pre1, %bb100 ], [ %v2, %bb2 ]
  store i64 %v3, ptr %p2, align 8
  ret void

bb100:                                            ; preds = %bb1
  %cond3 = call i1 @foo()
  %v3.pre1 = load i64, ptr %p1, align 8
  br i1 %cond3, label %bb3, label %bb101

bb101:                                            ; preds = %bb100
  store i64 %v3.pre1, ptr %p3, align 8
  ret void

bb200:                                            ; preds = %entry
  %cond4 = call i1 @foo()
  %v3.pre = load i64, ptr %p1, align 8
  br i1 %cond4, label %bb3, label %bb201

bb201:                                            ; preds = %bb200
  %_ = call i1 @maybethrow()
  store i64 %v3.pre, ptr %p4, align 8
  ret void
}

attributes #0 = { memory(none) }

%v5 is hoisted above call @maybethrow()

Carrot added a comment.Feb 1 2023, 2:50 PM

It's reverted now. Thanks for the test case. Will improve it in another version.

Carrot reopened this revision.Feb 2 2023, 8:08 PM
This revision is now accepted and ready to land.Feb 2 2023, 8:08 PM
Carrot updated this revision to Diff 494499.Feb 2 2023, 8:10 PM

Check for implicit control flow instruction in function findLoadToHoistIntoPred, so we can avoid moving load across unexpected control flow.

@reames and @hans please try to see if this patch works for your code.

reames added a comment.Feb 3 2023, 7:54 AM

This fixes (via the call to isDominatedByICFIFromSame) the issue I had identified in post commit review. I have not closely examined the patch more generally.

hans added a comment.Feb 3 2023, 12:02 PM

Check for implicit control flow instruction in function findLoadToHoistIntoPred, so we can avoid moving load across unexpected control flow.

@reames and @hans please try to see if this patch works for your code.

Sadly, our test fails also with this version.

We don't have a reduced repro.

Would it be possible to try this out on some code internally in case any test there exposes the problem?

In private email communication, @hans told me he couldn't find any problem in his optimized code, and his fail disappeared after updating his source code. So there is no known problem with this version.

Could anybody take another look at this patch? Compared to last version there is only one line change in function findLoadToHoistIntoPred.

chapuni accepted this revision.Feb 23 2023, 11:46 PM

Works for me.

I believe the updated patch doesn't address the compile-time regression yet -- it's still running a whole extra GVN iteration to remove the load that is left behind.

Carrot updated this revision to Diff 500326.Feb 24 2023, 4:23 PM

To improve compile time, now I record the basic blocks contain dead load instructions, at the end of the same iteration, process these blocks again to delete load instructions. So we can avoid the extra iteration on the whole function.

@nikic, the original version of this patch is 0.07% slower because more triggered optimizations caused more iterations on huge functions.
http://llvm-compile-time-tracker.com/compare.php?from=d38d6065584ee5dd837e9a629f90c731d8a7dffc&to=94fc2022ff32b63d6c744d3eeff4f304e1a81618&stat=instructions:u

The second version is 0.16% slower because it leaves the dead instructions to next iteration, so it causes an extra iteration on whole function. http://llvm-compile-time-tracker.com/compare.php?from=287508cd9c4396c8845d92310d258879202a179e&to=5f1448fe1585b5677d5f0064e4eeac3b493d8a18&stat=instructions%3Au

The current version is 0.08% slower, very close to the first version. This time I record the BBs contain dead stores, instead of revisit the whole function, I just revisit these recorded BBs. http://llvm-compile-time-tracker.com/compare.php?from=c8e5354f0063b090b6fa7ad4cfc2b9df78038454&to=d21dbfa65423bbd9251ab5948f6cc1baa2db61c8&stat=instructions:u.

Is it OK now?
Thanks.

Carrot added a comment.Apr 3 2023, 2:30 PM

@nikic, do you have any other concern on this patch?

@mkazantsev, could you help to take a look at the latest version?

@nikic expressed his concern on compile time, then I sent out this version to improve compile time. But @nikic didn't response for several weeks.

The change is the use of a new variable AdditionalWorkSet for tracking BBs contain dead load, so we can process these BBs from AdditionalWorkSet at the end of each iteration, and avoid an extra iteration on the whole function.

@nikic, the original version of this patch is 0.07% slower because more triggered optimizations caused more iterations on huge functions.
http://llvm-compile-time-tracker.com/compare.php?from=d38d6065584ee5dd837e9a629f90c731d8a7dffc&to=94fc2022ff32b63d6c744d3eeff4f304e1a81618&stat=instructions:u

The second version is 0.16% slower because it leaves the dead instructions to next iteration, so it causes an extra iteration on whole function. http://llvm-compile-time-tracker.com/compare.php?from=287508cd9c4396c8845d92310d258879202a179e&to=5f1448fe1585b5677d5f0064e4eeac3b493d8a18&stat=instructions%3Au

The current version is 0.08% slower, very close to the first version. This time I record the BBs contain dead stores, instead of revisit the whole function, I just revisit these recorded BBs. http://llvm-compile-time-tracker.com/compare.php?from=c8e5354f0063b090b6fa7ad4cfc2b9df78038454&to=d21dbfa65423bbd9251ab5948f6cc1baa2db61c8&stat=instructions:u.

Is it OK now?
Thanks.

0.08% compile time difference is too small to be considered significant. To double check, what is the clang self build time diff?

nikic added a comment.Apr 27 2023, 1:44 PM

Sorry for the long silence. I think it's fine in terms of compile time, but I guess I still don't fully get why it is better to reprocess the block than remove the load. I understand that InstrsToErase doesn't work for this case because it's only for the current BB, but we basically have the same code copied at the end of performScalarPRE() as well, so it seems like it should be possible to extract that into a common helper and use it here as well. Maybe I'm missing some subtlety.

Carrot updated this revision to Diff 518049.Apr 28 2023, 2:09 PM

Extract the common code for deleting an instruction to a function removeInstruction, call it from several places as required.

nikic requested changes to this revision.May 9 2023, 6:24 AM

I believe the change to metadata.ll is a miscompile. The new range metadata in that case should be a union of both old ranges. This is missing a combineMetadataForCSE() somewhere.

This revision now requires changes to proceed.May 9 2023, 6:24 AM
Carrot updated this revision to Diff 520874.Tue, May 9, 5:45 PM

Combine the old load's metadata before deleting it.

nikic accepted this revision.Sun, May 14, 7:45 AM

LGTM (Haven't reviewed the whole thing, but my concerns have been addressed.)

This revision is now accepted and ready to land.Sun, May 14, 7:45 AM
This revision was landed with ongoing or failed builds.Tue, May 16, 6:12 PM
This revision was automatically updated to reflect the committed changes.
TimNN added a subscriber: TimNN.Tue, May 16, 10:14 PM

I strongly suspect that this patch is breaking the Rust compiler when building against LLVM head:

Instruction does not dominate all uses!
  %11 = load i32, ptr %10, align 4, !range !10, !alias.scope !6314, !noalias !6319
  %19 = phi i32 [ %8, %9 ], [ %11, %32 ]
in function _RINvNtCsbJsLLO9YF7d_9rustc_hir10intravisit9walk_bodyINtNtCs7c017R9yI2R_10rustc_lint4late18LateContextAndPassNtBT_33BuiltinCombinedModuleLateLintPassEEBT_
LLVM ERROR: Broken function found, compilation aborted!
error: could not compile `rustc_lint` (lib)

From https://buildkite.com/llvm-project/rust-llvm-integrate-prototype/builds/19373

Changes since the last successful run of that build bot: https://github.com/llvm/llvm-project/compare/d3b9d8b28f8e9fde41a4136c28f458347d9bb292...6a2f52a3a00bdfc8bd4edf6592099a7a749d324e

(I don't have a minimized repro yet).

@TimNN, I have reverted it, LLVM buildbot also reported a sanitized bootstrap failure.

If you have simple repro, it will be a great help to me.

Thanks.

TimNN added a comment.Wed, May 17, 9:16 AM

@TimNN, I have reverted it, LLVM buildbot also reported a sanitized bootstrap failure.

If you have simple repro, it will be a great help to me.

Thanks for reverting! I sadly don't have a good repro, and likely won't have the time to try and get one in the near future.

The compiler crash didn't occur in my last commit, but occurred in this commit. It is caused by the interaction between the new instruction deletion method and the implementation of replaceValuesPerBlockEntry.

Consider the following case

; case1.

bb1:
   ...
   ; NewLoad is added here.
   br %cond1, label %bb2, label %bb3

bb2:
   OldLoad
   ...
   br label %bb3

bb3:
   Load
   ...

Once we decide to do the transformation, we take the following steps

  • Insert NewLoad into bb1.
  • Replace ValuesPerBlock(OldLoad->getParent(), OldLoad) with ValuesPerBlock(OldLoad->getParent(), NewLoad), so we'll get ValuesPerBlock(%bb2, NewLoad).
  • Delete instruction OldLoad, and replace all its uses with NewLoad.
  • Replace instruction Load with a new PHI instruction, required information is from ValuesPerBlock.

If we change the case to

; case2.

bb1:
   ...
   ; NewLoad is added here.
   br %cond, label %bb2, label %bb3

bb2:
   OldLoad
   ...
   br %cond2, label %bb5, label %bb4

bb4:
   ...
   br label %bb3

bb3:
   Load
   ...

We take the following steps to do the transformation

  • Insert NewLoad into bb1.
  • Try to replace ValuesPerBlock(OldLoad->getParent(), OldLoad) with ValuesPerBlock(OldLoad->getParent(), NewLoad). But we don't have an entry for ValuesPerBlock(%bb2, OldLoad), instead we have an entry ValuesPerBlock(%bb4, OldLoad), and it is not changed.
  • Delete instruction OldLoad, and replace all its uses with NewLoad.
  • Replace instruction Load with a new PHI instruction, required information is from ValuesPerBlock. Then we get the value OldLoad from %bb4 again, because OldLoad is already deleted, it's an invalid value, and causes crash.

In my last version, OldLoad isn't deleted immediately, the reference to it in the PHI instruction is valid and correct. In the next iteration of GVN, OldLoad can be found as fully redundant, then it is deleted and the use in PHI instruction is updated to use NewLoad.

It should be fixed in replaceValuesPerBlockEntry, every entry with the value OldLoad should be changed to NewLoad. Because OldLoad is dominated by NewLoad, so this change is safe.

Carrot reopened this revision.Fri, May 26, 1:22 PM
This revision is now accepted and ready to land.Fri, May 26, 1:22 PM
Carrot updated this revision to Diff 526181.Fri, May 26, 1:29 PM

Updated the function replaceValuesPerBlockEntry to replace all OldLoad with NewLoad. So the deleted OldLoad instruction will not be used by later PHI instruction.

@TimNN, @antmo, @nickdesaulniers, @maxim-kuvyrkov could you help to test if this version work for you?

thanks

antmo added a comment.Tue, May 30, 8:33 AM

the new version looks ok here (no more crash on ByteCode.cpp)

Updated the function replaceValuesPerBlockEntry to replace all OldLoad with NewLoad. So the deleted OldLoad instruction will not be used by later PHI instruction.

@TimNN, @antmo, @nickdesaulniers, @maxim-kuvyrkov could you help to test if this version work for you?

thanks

Sorry, before even being able to test the new patch, I'm running into numerous regressions that I need to sort out first:

  1. https://github.com/ClangBuiltLinux/linux/issues/1855
  2. https://github.com/ClangBuiltLinux/linux/issues/1856
  3. https://github.com/ClangBuiltLinux/linux/issues/1857

It's unusual to have that many regressions over the weekend, but may we please have some time to resolve those so that we can re-test this patch properly (before you resubmit)?

@antmo, thank you for your verification.

@nickdesaulniers, no problem to me, thanks!

nickdesaulniers accepted this revision.Wed, May 31, 1:59 PM

Thanks for your patience. I was able to verify that Diff 526181 no longer ICE's as was observed by @maxim-kuvyrkov in linux stable 6.3.y for ARCH=arm allyesconfig https://ci.linaro.org/job/tcwg_kernel--llvm-master-arm-stable-allyesconfig-build/29/artifact/artifacts/06-build_linux/console.log.xz.

TimNN added a comment.Fri, Jun 2, 8:15 AM

I could _not_ reproduce the problems I previously saw when building the Rust compiler, so seems to be all good now :).

Carrot added a comment.Fri, Jun 2, 3:52 PM

@nickdesaulniers, @TimNN, thank you for your verification.

@mkazantsev, @nikic, could you take a look at this version, the only modification is in function replaceValuesPerBlockEntry and a new test case.

nikic accepted this revision.Mon, Jun 5, 12:06 AM

LGTM

This revision was landed with ongoing or failed builds.Tue, Jun 6, 12:47 PM
This revision was automatically updated to reflect the committed changes.