This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Allow promotion of some stores that are not guaranteed to execute
ClosedPublic

Authored by mkuper on Dec 28 2016, 4:24 PM.

Details

Summary

Promotion is always legal when a store within the loop is guaranteed to execute.

However, this is not a necessary condition - for promotion to be memory model semantics-preserving, it is enough to have a store that dominates every exit block. This is because if the store dominates every exit block, the fact the exit block was executed implies the original store was executed as well.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 82627.Dec 28 2016, 4:24 PM
mkuper retitled this revision from to [LICM] Allow promotion of some stores that are not guaranteed to execute.
mkuper updated this object.
mkuper added reviewers: reames, eli.friedman, danielcdh.
mkuper added a subscriber: llvm-commits.
reames requested changes to this revision.Dec 28 2016, 9:02 PM
reames edited edge metadata.

I'm responding to the description. If the implementation splits a hair the description doesn't, my response may be off base.

I don't think this transform is legal. The exception being thrown might be what prevents the otherwise faulting load or store from executing.

Consider this small example:
for(.i = ...) {

throw_if_out_of_bounds(i);
a[i] += 5;

}

We can't speculatively execute the load from a[i] as it might be the case that it crosses a page boundary and dereferencing the memory will introduce a page fault which doesn't exist in the original program.

(I'm deliberately using a loop variant i in the example, but the same thing applies if i is loop invariant.)

lib/Transforms/Scalar/LICM.cpp
911–912

You don't need a store. A load is sufficient for this part.

This revision now requires changes to proceed.Dec 28 2016, 9:02 PM
reames added inline comments.Dec 28 2016, 9:06 PM
test/Transforms/LICM/scalar_promote.ll
209

Both of your examples have unconditional loads with conditional stores. This is interesting in that the dereferenceability would seem to be established by the load. Is this representative of the real example you're trying to address?

Thanks, Philip!

Yes, sorry, that was unclear from the description. I think it's clearer in the inline comment.

For promotion to be legal you need two conditions to hold:

  1. You can unconditionally dereference the pointer in a hoisted load. This holds is if at least one store or load is guaranteed to execute - but that's not a necessary condition. You may know the pointer is dereferencible through other means.
  2. You are allowed to sink the store, from the memory model point of view. The previous code would check that the store is either guaranteed to execute or, if not, stores to a thread local.

What I'm doing here is relaxing condition 2.
Assuming I got the logic right, this patch should not be introducing any cases where the load can fault. That's why my tests have an unconditional load.

This targets reduction loops like

for (int i = 0; i < n; ++i)
  for (int j = 0; j < m; ++j)
    local[i] += foo(i,j)

Where local[i] would otherwise be promotable within the loop, except that the store is not guaranteed to execute because foo() may throw. Note that the load precedes the call to bar.

mkuper updated this revision to Diff 82677.Dec 29 2016, 10:15 AM
mkuper edited edge metadata.

Added more tests that hopefully alleviate Philip's concerns.

reames requested changes to this revision.Dec 29 2016, 12:36 PM
reames edited edge metadata.

Ok, with the inline discussion, it looks like this is a valid and useful extension. Can you update the description of the patch to reflect the discussion?

Inline code comments below.

lib/Transforms/Scalar/LICM.cpp
253–259

You've got a scattering of unrelated trivial cleanup changes here (style fixes, clarifying comments on the original code, etc..). Please separate those and submit them. No further review needed, but I'd like to get them out of this patch for future rounds.

258

Changing from the or and exit scheme to the and scheme should be NFC. Please separate and land without further review. (In particular, if this turns out *not* to be NFC, I want to be able to bisect to the refactoring.)

298

You switching from a lazy to an eager computation of both exit blocks and insert pts. In this case, it shouldn't matter since hasDedicatedExits also does the same (relatively expensive) computation of the loop exits.

At some point, we really do need to cache the loop exit walk.... (not for this change)

301

Use of the Changed variable here is overly conservative. We actually only need to run this if *store promotion* changed the loop. If you wanted to do that as a a separate change, it would be a useful cleanup and compile time win.

958

This loop is getting really complicated and hard to follow. I think we need to find an alternate structure here which makes your changes easier to reason about.

One idea would be to introduce two variables for the two different things we're trying to establish. (i.e. LocationIsDereferencable, and CanInsertStoresOnExits.) The key simplification is that we don't need to track *why* we know we can insert the stores.

I'm also open to other organizations here.

983–984

You still need a slightly modified version of this comment.

1039–1040

I'm having a hard time following the semantic changes here intermixed with the refactoring. Please separate that and commit without further review. I suspect that once that's done, I'll spot a couple more things here.

test/Transforms/LICM/scalar_promote.ll
208

Looking at this test, it really should be handled by the thread local code we already have. Have you looked to see why it isn't?

In fact, looking at *all* of your tests, the thread local case should handle it...?

(Note: This is a very important question to that I'll want an answer to before we commit the semantics parts of this code.)

This revision now requires changes to proceed.Dec 29 2016, 12:36 PM
mkuper added inline comments.Dec 29 2016, 2:20 PM
lib/Transforms/Scalar/LICM.cpp
253–259

I'm not sure all of those changes makes sense while ExitBlocks are still computed lazily. I'll separate as many of them as reasonable, let me know if you still want things unbundled at the next iteration.

258

So, it's not NFC on its own, because it also requries the sinking the formLCSSARecursively() call out of the if.

Note that:

  1. If Changed was true, then "Preheader || L->hasDedicatedExits()" is true, because the arguments to the or guard the previous transofrmations.
  2. The "if (Changed)" on line 269 can be true even if promoteLoopAccessesToScalars() made no changes, since it's the same Changed variable, not one defined within the if block.

So we end up running formLCSSARecursively() for *any* change. Which works, by accident, but is really counter-intuitive. And would breaks if DisablePromotion actually happend to be true.

The two changes should be NFC when combined, though, so I'll commit that separately.

298

It's not just that - isGuaranteedToExecute() also does the same computation...
But yes, that's a separate change.

301

I think that's actually wrong (and the comment is misleading, I should fix it)- see above for explanation.

I mean, I'm not saying that we *need* to run it in other cases (although I'm pretty sure we do), I'm saying we *did* run it in other cases, and I'd prefer this part to be NFC.

911–912

No, if the load is guaranteed to be executed, but the store is under control flow, then it's not safe to promote, unless you have additional information (next paragraph).

I can try to rewrite this whole thing, but I'm not sure I can make it both precise and clearer.

958

I agree. I'll see if I can simplify.

test/Transforms/LICM/scalar_promote.ll
208

Why would it be handled by the thread-local code?
%local is an alloca, so it's not thread-local memory.

mkuper added inline comments.Dec 29 2016, 2:42 PM
lib/Transforms/Scalar/LICM.cpp
301

And looks like I was wrong, apparently this really is fine to run only for promotions, I guess the other transformations preserve LCSSA.
I'll sort all of that out first, then update this patch.

Responding to the thread local discussion, but leaving the rest until you've separated the various pieces and landed them.

test/Transforms/LICM/scalar_promote.ll
208

An alloca is thread local until the first time it escapes and become visible to another thread. This is the same as a call to malloc().

I think the isAllocLikeFn check in the thread local case should become (isAllocLikeFn || isa<AllocaInst>). I believe that will handle the cases you've mentioned.

mkuper added inline comments.Dec 29 2016, 4:17 PM
test/Transforms/LICM/scalar_promote.ll
208

An alloca is thread local until the first time it escapes and become visible to another thread. This is the same as a call to malloc().

Argh, right.

I think the isAllocLikeFn check in the thread local case should become (isAllocLikeFn || isa<AllocaInst>). I believe that will handle the cases you've mentioned.

I guess you're right, that will solve most issues. Thanks for the (failed :-) ) sanity check.
I think we may want both things, though - this can fire even if the alloca is captured (or if it's a global or a function argument instead of the alloca), if bar is argmemonly/readnone.

reames added inline comments.Dec 29 2016, 4:35 PM
test/Transforms/LICM/scalar_promote.ll
208

Completely agreed that we want both. Multiple ways to skin a cat are always welcome; sometimes one falls over where another doesn't.

Why don't you spawn another thread for the thread local change and save this one for the current approach? I see that you've split out most of the pieces, so once we rebase this should be much easier to review. I'd really like to see both pieces land in the next day or so.

So, the patch is now fairly small, but I'm having trouble with the testcases.
In particular, I think what we have now is unsound, because isGuaranteedToTransferExecutionToSuccessor() returns true for argmemonly calls, which means that things that should only get promoted with my patch get promoted without.

I'll get back to this on Tuesday. For now, I'll upload the code change, without fixing the tests, in case you want to take a look at the semantic change.

mkuper updated this revision to Diff 82714.Dec 29 2016, 6:03 PM
mkuper edited edge metadata.

So, the patch is now fairly small, but I'm having trouble with the testcases.
In particular, I think what we have now is unsound, because isGuaranteedToTransferExecutionToSuccessor() returns true for argmemonly calls, which means that things that should only get promoted with my patch get promoted without.

We have a prevaling assumption that to throw an exception, you must write to some global (but unknown) location to do so. We effectively assume that any readonly call is also nounwind. As you point out, we also assume that argmemonly must be readonly with respect to that unknown magic location. That doesn't seem sound as the location written to might be reachable through an argument of the function. So, yes, I think this is wrong.

I'll get back to this on Tuesday. For now, I'll upload the code change, without fixing the tests, in case you want to take a look at the semantic change.

Thanks. I took a look and have only minor comments on what's left. Once we have the above issue fixed, this will be quick.

lib/Transforms/Scalar/LICM.cpp
928

if an exception is thrown *on the first iteration*.

1012

Expand this comment a bit. possibly: because it must have executed in the original program before reaching one of the inert points.

sanjoy added a subscriber: sanjoy.Dec 31 2016, 2:25 PM

So, the patch is now fairly small, but I'm having trouble with the testcases.
In particular, I think what we have now is unsound, because isGuaranteedToTransferExecutionToSuccessor() returns true for argmemonly calls, which means that things that should only get promoted with my patch get promoted without.

That was an oversight, fixed in https://reviews.llvm.org/rL290794

I'll get back to this on Tuesday. For now, I'll upload the code change, without fixing the tests, in case you want to take a look at the semantic change.

mkuper updated this revision to Diff 82919.Jan 3 2017, 11:38 AM
mkuper updated this object.
mkuper edited edge metadata.

After r290794, it looks like the tests are actually checking what they should.

efriedma added inline comments.
test/Transforms/LICM/scalar_promote.ll
222

I'm pretty sure this transform is wrong. For example, @capture could create a mutex protecting %local, and lock it. Then @opaque could release the mutex, does some work, and lock the mutex. In that case, the value stored by store i32 %x2, i32* %local is visible to other threads in each iteration, so you can't sink the store out of the loop.

efriedma requested changes to this revision.Jan 3 2017, 12:27 PM
efriedma added a reviewer: efriedma.
This revision now requires changes to proceed.Jan 3 2017, 12:27 PM
efriedma added inline comments.Jan 3 2017, 12:32 PM
test/Transforms/LICM/scalar_promote.ll
222

Err, my example is overly complicated. You don't need multiple threads. The address of %local escapes, therefore @opaque could read it, therefore we have to update it every iteration of the loop.

efriedma resigned from this revision.Jan 3 2017, 12:34 PM
efriedma removed a reviewer: efriedma.
efriedma added inline comments.
test/Transforms/LICM/scalar_promote.ll
222

Err, sorry, ignore me, didn't notice the argmemonly annotation on @opaque. I need to think a bit.

mkuper added inline comments.Jan 3 2017, 12:57 PM
test/Transforms/LICM/scalar_promote.ll
222

I'm not actually changing this behavior.
The transformation is already performed when @opaque is marked as nounwind. This patch only removes the nounwind requirement, which has no bearing on your scenario.

In any case, I don't think the stores are required to be observable in this scenario (and you also need to pass the mutex as an argument, but, technically, that's possible), so I believe the transformation is still legal. If we reach the conclusion it's not, then it will still be legal if opaque is "argmemonly readonly", since readonly functions are not allowed to perform atomic writes. So we'll need to start requiring readonly (and not just argmemonly), but that's orthogonal to this patch.

efriedma accepted this revision.Jan 3 2017, 2:11 PM
efriedma added a reviewer: efriedma.

I'm satisfied with this, with some minor tweaks.

lib/Transforms/Scalar/LICM.cpp
1019

This check is similar to llvm::isGuaranteedToExecute. Maybe worth noting the parallel here.

If the loop throws, you're depending on the earlier check that the memory location is an alloca; you should explicitly note that here.

test/Transforms/LICM/scalar_promote.ll
222

Mmm... okay, I think I understand what's happening.

From a memory model perspective, the rule is this: if the current thread performs a non-atomic store to %local, no other thread is allowed to read that memory until the current thread performs an atomic release (or equivalent). Therefore, if a loop doesn't contain an atomic release, and every exit is dominated by a non-atomic store, you can sink the store out of the loop. (An atomic release gets modeled by alias analysis as a read to all memory locations whose addresses have escaped, so the existence of the alias set implies the loop doesn't contain a relevant atomic release.)

The wrinkle introduced by a call which throws is that there are additional exits which are not modeled in the CFG; the store might not actually dominate all exits. That said, if a loop throws, we only allow sinking stores to allocas, and allocas get deallocated when the loop throws, so we're still okay.

mkuper added a comment.Jan 4 2017, 2:38 PM

Thanks Eli.

Philip, do you have any more comments?

lib/Transforms/Scalar/LICM.cpp
1019

I'm not depending on it being an alloca. I'm only depending on it being dereferenceable - which is the DereferenceableInPH check and is documented. So I'm not sure how this can be more explicit.

efriedma added inline comments.Jan 4 2017, 3:19 PM
lib/Transforms/Scalar/LICM.cpp
1019

The implicit unwind edges from the loop aren't listed in ExitBlocks, so you aren't checking them. If the memory is an alloca, this doesn't matter because the unwind edge will end the lifetime of the alloca; if we somehow extended LICM so it could make implicit unwind edges explicit, the missing check might matter.

mkuper added inline comments.Jan 4 2017, 3:23 PM
lib/Transforms/Scalar/LICM.cpp
1019

Oh, got it, you mean the check on line 951. Right, will add a comment.

No further comments, LGTM

This revision was automatically updated to reflect the committed changes.