[GVN] Prevent LoadPRE from hoisting across instructions that don't pass control flow to successors
Needs ReviewPublic

Authored by mkazantsev on Tue, Sep 5, 3:08 AM.

Details

Summary

This patch fixes the miscompile that happens when PRE hoists loads across guards and
other instructions that don't always pass control flow to their successors. PRE is now prohibited
to hoist across such instructions because there is no guarantee that the load standing after such
instruction is still valid before such instruction. For example, a load from under a guard may be
invalid before the guard in the following case:

int array[LEN];
...
guard(0 <= index && index < LEN);
use(array[index]);

Diff Detail

mkazantsev created this revision.Tue, Sep 5, 3:08 AM
mkazantsev updated this revision to Diff 113820.Tue, Sep 5, 3:11 AM

Fixed formatting.

dberlin requested changes to this revision.Tue, Sep 5, 6:42 AM
  1. This seems wrong for assumes (I don't know enough about guards).

It is not incorrect to hoist through an assume, nor should doing so generate wrong code.
Assumes are essentially metadata, and documented as such.

Can you provide an example where the code generated is incorrect?

(guards i don't know enough about to say anything).

  1. You've done this in an N^2 way when it should be amortized constant time (using orderedinstructions)
This revision now requires changes to proceed.Tue, Sep 5, 6:42 AM
mkazantsev added a comment.EditedTue, Sep 5, 8:09 PM

Hi Daniel,

  1. The motivating example is the following situation:

int array[LEN];

void accessArray(int index) {
guard(0 <= index && index < LEN);
do_something(array[index]);
}

Here we guard on condition that our array access is within bounds, otherwise we reject the compiled code and deoptimize. Hoisting the array access to a place above the guard means that we try to access the array without knowing if it within the bounds or not. I will add a corresponding test.

  1. It seems that you're right about assumes, I will remove them.
  1. Ok, I will try to do it in a faster manner.
mkazantsev updated this revision to Diff 113971.Wed, Sep 6, 2:25 AM
mkazantsev retitled this revision from [GVN] Prevent LoadPRE from hoisting through guards and assumes to [GVN] Prevent LoadPRE from hoisting through guards.
mkazantsev edited the summary of this revision. (Show Details)
  1. Removed assumes-related stuff.
  2. Added a motivating test and comment.
  3. Daniel, I didn't get what you mean under using ordered instructions to do this stuff. OrderedInstructions can only answer if a particular guard dominates our load. It will still take linear time to gather the guards, and for them we know for sure that they dominate the load and don't basically need to ask anything from OrderdeInstructions. I used any_of to speed the things up, but don't see any way to reduce complexity.

Could you please explain what you mean under using OrderedInstructions for gaining amortized constant complexity?

dberlin added a comment.EditedWed, Sep 6, 1:36 PM

The problem:
Right now, for each load, you go through the loadbb, and see if you hit a guard.
This information does not change for each load. At most it changes as GVN removes things (and even then, no).
The number of guards is smaller than the number of blocks.

Thus, you could use ordered instructions to make this O(guards) * O(Number of loads) instead of O(number of instructions in each load block) * O(number of loads)
Trivially.
But you can do even better, and this is the constant time i refer to:
To make it constant time to check, and O(guards) to compute:

For each guard in the program:
   // Make FirstGuard point to the first guard in each block
   auto Iter = FirstGuard.(guard->getParent())
   if (Iter != FIrstGuard.end())
      FirstGuard.insert({guard->parent(), guard})
   else
     if (OrderedInstructions->dominates(*Iter, guard))
      Iter->second = guard

In processLoad, replace your entire loop with:

if the bb doesn't exist in firstguard
    you are fine
if the bb does, check whether orderedinstructions->dominates(FirstGuard[BB], load)

Basically, this is: Build a table of where the first guard in the block is. Since you are trying to see if there is a guard above a given instruction in the block, the only thing that matters is where the first guard in the block is, because eventually you will hit that guard if you go up. OrderedInstructions will tell you whether that first guard is before or after the load in the block.

You already walk all the instructions in the main gvn loop in RPO order. You can fill in firstguard as you go, as it will see all the guards before anything that one could try to lift above, except in the case of loops.

mkazantsev planned changes to this revision.Wed, Sep 6, 10:45 PM

Thanks for the detailed clarification! I will do that.

Of course. Just in case you try to use the code, i'm pretty sure if (OrderedInstructions->dominates(*Iter, guard))
should be if (OrderedInstructions->dominates(guard, *Iter))

(At least one of the calls is reversed)

mkazantsev updated this revision to Diff 114311.Fri, Sep 8, 1:21 AM

Actually we don't even need OrderedInstructions to do that. It's enough to maintain a set of blocks where we have ever seen the guards. We iterate through blocks in reverse postorder which guarantees us that if a block has one predecessor it will be visited before the block. We also traverse block instructions in direct order. Using this, we just mark a block as having a guard when we meet one and know for sure that:

  1. if the current block is marked, there is a guard above the current instruction, and we cannot hoist;
  2. if the only predecessor (recursively) is marked, it also blocks us from hoisting.

I've updated the patch using this solution. Please let me know if it makes sense to you.

Actually we don't even need OrderedInstructions to do that. It's enough to maintain a set of blocks where we have ever seen the guards. We iterate through blocks in reverse postorder which guarantees us that if a block has one predecessor it will be visited before the block. We also traverse block instructions in direct order. Using this, we just mark a block as having a guard when we meet one and know for sure that:

  1. if the current block is marked, there is a guard above the current instruction, and we cannot hoist;
  2. if the only predecessor (recursively) is marked, it also blocks us from hoisting.

    I've updated the patch using this solution. Please let me know if it makes sense to you.

Sure, this will work, but it requires clearing the guard set between iterations :)
Given how much faster either solution is than the original option, seems fine to me. I doubt it will ever be a real performance issue.

The expected size of set of blocks with guards is actually small, so cleaning should be cheap. Could you please approve the patch for merge?

mkazantsev planned changes to this revision.Fri, Sep 8, 1:54 AM

Need to clean guard set between iterations, not one time in the end.

mkazantsev updated this revision to Diff 114316.Fri, Sep 8, 1:57 AM

Cleaning between iterations now.

efriedma added inline comments.
lib/Transforms/Scalar/GVN.cpp
2050

Why does this specifically apply to guards, as opposed to anything where isGuaranteedToTransferExecutionToSuccessor() is false?

dberlin added inline comments.Fri, Sep 8, 12:55 PM
lib/Transforms/Scalar/GVN.cpp
2050

Yeah, it looks like GVN presumes that if it's a straight line non-infinite loop, they have the same "guaranteed to execute" properties.

As you correctly point out, since LLVM has implicit control flow, this is not always true.
I suspect GVN's PRE is quite broken here in that respect in theory.

However, most of the effect of isGuaranteed is on memory instructions, and those almost certainly get corrected by memdep (IE it ends up respecting the same things that isGuaranteed* is respecting).
Going through the rest:

if (const auto *CRI = dyn_cast<CleanupReturnInst>(I))
   return !CRI->unwindsToCaller();
 if (const auto *CatchSwitch = dyn_cast<CatchSwitchInst>(I))
   return !CatchSwitch->unwindsToCaller();
 if (isa<ResumeInst>(I))
   return false;
 if (isa<ReturnInst>(I))
   return false;
 if (isa<UnreachableInst>(I))
   return false;

None of these will have successors, so there shouldn't be a predecessor we hit to check anyway.
You can hoist above them, just not sink below them.

So those should be fine.

The callsite checking, except for throwing, is probably also handled by memdep's insertion of memory dependencies.

So I suspect you may be able to rig up a testcase where memdep and isGuaranteed* end up disagreeing in a way that causes this to break. But i also suspect it's hard.

Thoughts?

(and using isGuarante* would handle guard since guard is marked as throwing)

So I suspect you may be able to rig up a testcase where memdep and isGuaranteed* end up disagreeing in a way that causes this to break.

See https://reviews.llvm.org/D21041. :)

mkazantsev added inline comments.Mon, Sep 11, 12:10 AM
lib/Transforms/Scalar/GVN.cpp
2050

Hi Eli,

I tried to use use isGuaranteedToTransferExecutionToSuccessor, and it works for my example, but another tests starts failing. The test is test7 from Transforms/GVN/PRE/volatile.ll which is:

; Does cross block PRE work with volatiles?
define i32 @test7(i1 %c, i32* noalias nocapture %p, i32* noalias nocapture %q) {
; CHECK-LABEL: test7
; CHECK-LABEL: entry.header_crit_edge:
; CHECK:       %y.pre = load i32, i32* %p
; CHECK-LABEL: skip:
; CHECK:       %y1 = load i32, i32* %p
; CHECK-LABEL: header:
; CHECK:      %y = phi i32
; CHECK-NEXT: %x = load volatile i32, i32* %q
; CHECK-NEXT: %add = sub i32 %y, %x
entry:
  br i1 %c, label %header, label %skip
skip:
  %y1 = load i32, i32* %p
  call void @use(i32 %y1)
  br label %header
header:
  %x = load volatile i32, i32* %q
  %y = load i32, i32* %p
  %add = sub i32 %y, %x
  %cnd = icmp eq i32 %add, 0
  br i1 %cnd, label %exit, label %header
exit:
  ret i32 %add
}

It now fails because this function returns false for olatile loads and stores because they are allowed to trap. This logic was added in patch https://reviews.llvm.org/D21167 and Sanjoy's comment about why we do that remained unanswered. I don't clearly understand why you've added that, could you please clarify why it was done and is it OK if this test now fails?

Thanks,
Max

We consider volatile loads and stores as potentially trapping as a courtesy to our users. Strictly speaking, it's undefined behavior in C, but people sometimes intentionally trigger SIGSEGV using volatile loads, so we don't want to cause an unrelated crash before that happens. Breaking that testcase should be fine.

Eli, i presume an updated version with isGuaranteed... should fix your testcases from 21041, right?
If so, Max, can you add them to this patch?

I think it will fix those testcases... at least, it will fix the load testcase (I don't think this patch addresses the divide-by-zero problem).

We consider volatile loads and stores as potentially trapping as a courtesy to our users. Strictly speaking, it's undefined behavior in C, but people sometimes intentionally trigger SIGSEGV using volatile loads, so we don't want to cause an unrelated crash before that happens. Breaking that testcase should be fine.

I disagree. This would be breaking optimizations *across* a volatile load, not breaking optimizations *of* a volatile load. I think the isGuaranteedTo... scheme is probably the right one, but I think we got our semantics for that function wrong for volatile loads.

Given we've not documented volatile loads as potentially trapping, I don't think we should *partially* implement that semantic. I'd actually like to have a faulting load concept in the IR, but hacking that in for volatile seems misguided.

For the moment, I'll suggest a compromise to unblock this review from the broader discussion. Can we use the isGuaranteedTo... scheme, but additionally whitelist skipping over volatile loads? That let's us make progress on this correctness issue while taking the volatile semantics discussion elsewhere.

lib/Transforms/Scalar/GVN.cpp
2051

I'm not sure building this set in RPO is sufficient in the face of cycles. I believe the current PRE code would bail out of said cycles, but at minimum, that assumptions needs to be carefully documented and asserted for.

i.e. how do we know that BlocksWithGuards has been populated for each block we traverse backwards through?

In particular, if we extend this to isGuaranteedTo... and the scalar PRE case from https://reviews.llvm.org/D21041, it's not obvious to me this code would be correct.

The volatile load thing never really got discussed in depth, as far as I remember... I wanted to be conservative because volatile memory access can be weird, but I guess my interpretation in isGuaranteedToTransferExecutionToSuccessor() is a little suspect given the way alias analysis treats volatile operations.

The volatile load thing never really got discussed in depth, as far as I remember... I wanted to be conservative because volatile memory access can be weird, but I guess my interpretation in isGuaranteedToTransferExecutionToSuccessor() is a little suspect given the way alias analysis treats volatile operations.

I think that it would be good to revisit this. As you say, we'd also need to model this in AA (by essentially making all volatile access universally aliasing because the trap handlers could be arbitrary functions). Moreover, it would essentially mean modeling volatile loads as both reading and writing to memory.

What I consider to be "regular" uses of volatile, handling memory-mapped I/O and communicating with signal/interrupt handlers, I don't think require this "loads also write" treatment. These use cases might want universally-aliasing stores (because the stores might trigger some device or signal/interrupt handler to do something arbitrary), but can allow more-precise AA for volatile loads.

I don't know exactly what use cases we want to support with volatile accesses. If it's essentially just memory-mapped I/O (for device drivers and the like), then it might make sense to make . If we other use cases, then it might not make sense (and, as Philip says, we might want some other mechanism).

If we do want to make volatile accesses universally aliasing and writing, then I think we should remove them from the IR all together and add some intrinsics instead for this use case.

dberlin added inline comments.Mon, Sep 11, 7:53 PM
lib/Transforms/Scalar/GVN.cpp
2051

It's definitely not sufficient in the case of cycles.

The only reason it's thought to be sufficient here is that you are guaranteed that any straight line predecessors have already been processed.

Since it's only walking back through straight line predecessors, ...

So yes, if it tried to use this info in random predecessors, it would not be correct the way it is being computed here.

mkazantsev added inline comments.Mon, Sep 11, 10:02 PM
lib/Transforms/Scalar/GVN.cpp
2051

Yes, we only can rely on this set if we have visited all the predecessors before. This is true for the case where we use it (because there is only 1 predecessor) and not true in general. I'll write a comment on that to make it clear.

mkazantsev edited the summary of this revision. (Show Details)

Reused isGuaranteedToTransferExecutionToSuccessor, added some clarifying comments. In this version of patch, we explicitly allow hoisting across volatile loads/stores. Given the discussion here, we might want to modify isGuaranteedToTransferExecutionToSuccessor so that it returns true for volatile loads/stores. Once the ongoing discussion on volatiles comes to some conclusion, this part can be reworked.

Could you add the testcases from D21041?

mkazantsev retitled this revision from [GVN] Prevent LoadPRE from hoisting through guards to [GVN] Prevent LoadPRE from hoisting across instructions that don't pass control flow to successors.
mkazantsev edited the summary of this revision. (Show Details)

Added some tests, changed commit message.

dberlin added inline comments.Tue, Sep 12, 10:08 PM
lib/Transforms/Scalar/GVN.cpp
2076

It's worth noting:

The RPO iteration order only guarantees that checking straight line predecessors is safe[1]
It would not be safe to check other predecessors unless the program is loop free
(IE such a thing would have to be computed separately from this loop).

[1] I can prove this if you like - A visit order dfs walk of the dominator tree is a valid RPO order [2]. Any straight line single predecessor must be your idom, because there is no path around it to your node. Because the dominator tree is acylic, and a valid RPO order, RPO order must guarantee you visit this node before you.
The only way something else can be visited first is if there was another path to your node (which would in turn prove that that your idom is not really your idom)
,
[2] LLVM does not happen to keep it in the sort order, but you can see that, for example, NewGVN sorts it back into the "right" order.

mkazantsev marked an inline comment as done.

Factored the filling of implicit control flow map out of main iteration loop.

dberlin added inline comments.Wed, Sep 13, 7:39 AM
lib/Transforms/Scalar/GVN.cpp
2365

This is expensive to compute, you should pass it in.

2379

for (auto &I : BB)

(the &*BI will just be &I below)

2382

This map could be const *BasicBlock->const *Instruction, FWIW.

mkazantsev marked 3 inline comments as done.

Nits.

efriedma added inline comments.Wed, Sep 13, 11:25 AM
test/Transforms/GVN/PRE/local-pre.ll
50

GVN got smarter, so this isn't actually testing what it's supposed to test anymore. Maybe try replacing "icmp eq i32 %p, %q" with "icmp eq i32 %p, %r", where "%r" is a parameter. And I guess pass %a to may_exit(), so it won't get moved if we come up with some new form of sinking in the future.

mkazantsev added inline comments.Wed, Sep 13, 8:03 PM
test/Transforms/GVN/PRE/local-pre.ll
50

What is the difference between icmp eq i32 %p, %q where %q is a parameter and icmp eq i32 %p, %r where %r is a parameter? If you mean that we should branch by unknown condition then we could just add parameter i5 %cond.

Anyways, I don't think that extension of the test set can somehow block us from merging this functional fix. I also don't see what is the conceptual difference between this test and what you describe. The transformation across may_exit still doesn't happen. Do you mind if we consider adding new tests as NFC later?

mkazantsev added inline comments.Wed, Sep 13, 8:04 PM
test/Transforms/GVN/PRE/local-pre.ll
50

%i1 cond off course :)

efriedma added inline comments.Thu, Sep 14, 12:30 PM
test/Transforms/GVN/PRE/local-pre.ll
50

i1 %cond probably does the same thing, yes. The important part is that we don't simplify sdiv i32 %p, %q to 1.

Fixing the underlying issue here doesn't necessarily block committing this patch, I guess... but it's probably something you want to address at some point, and it's very closely related to this patch.

mkazantsev added inline comments.Fri, Sep 15, 4:08 AM
test/Transforms/GVN/PRE/local-pre.ll
50

I want to clearly distinguish fixing miscompile with guards and a missing opportunity for turning this div into 1. Do we have a bug to address this missing opportunity? You could file it if you think this case is important. But I don't think it is a part of the problem being addressed by this patch.

efriedma added inline comments.Fri, Sep 15, 11:06 AM
test/Transforms/GVN/PRE/local-pre.ll
50

Sorry, probably should have written this out in the first place.

GVN currently performs an incorrect optimization for the following testcase:

define i32 @test2(i32 %p, i32 %q, i1 %r) {
block1:
  br i1 %r, label %block2, label %block3

block2:
 %a = sdiv i32 %p, %q
 br label %block4

block3:
  br label %block4

block4:
  %phi = phi i32 [ 0, %block3 ], [ %a, %block2 ]
  call void @may_exit(i32 %phi) nounwind
  %b = sdiv i32 %p, %q
  ret i32 %b

}
declare void @may_exit(i32) nounwind
mkazantsev marked an inline comment as done.Fri, Sep 15, 8:17 PM
mkazantsev added inline comments.
test/Transforms/GVN/PRE/local-pre.ll
50

Ah, now I got what you mean. :) Thanks for pointing out! Fixed.

mkazantsev marked an inline comment as done.

Fixed one more case with PRE, updated tests accordingly.

dberlin added inline comments.Fri, Sep 15, 9:07 PM
lib/Transforms/Scalar/GVN.cpp
2170

Please use ordered instructions, which caches this, and invalidate it as needed.