Further relax the constraint on atomics in MemoryDependencyAnalysis.cpp

Authored by morisset on Aug 11 2014, 10:46 AM.



This is the patch 3/3 resulting from the split of D4797.

Even loads/stores that have a stronger ordering than monotonic can be safe.
The rule is no release-acquire pair on the path from the QueryInst, assuming that
the QueryInst is not atomic itself.

I also cleaned up a bit the test DeadStoreElimination/atomic.ll
as I was adding tests to it

Depends on D4797
Depends on D4844

Diff Detail

morisset retitled this revision from to Further relax the constraint on atomics in MemoryDependencyAnalysis.cpp.Aug 11 2014, 10:46 AM
morisset updated this object.
morisset edited the test plan for this revision. (Show Details)
morisset added reviewers: jfb, reames.
morisset added a subscriber: Unknown Object (MLST).
morisset updated this revision to Diff 12425.Aug 12 2014, 5:08 PM

The signature of the helpers introduced in D4844 changed, so this
revision needs this fix to use the new one.

morisset updated this revision to Diff 12627.Aug 18 2014, 2:44 PM

Just a rebase since D4797 changed.

jfb added a comment.Aug 20 2014, 9:53 PM

The change looks good overall, but I'd like to have @reames review it too.


Why not also test load-acq followed by store-rel here and in other places (or the reverse store/load)? It seems like a good sanity check.


Drop the extra space.

Comments inline. I am not yet convinced of the correctness.


I'm still not finding this explanation particularly clear. I'm going to ask you not to commit this until you can justify why this approach is correct.

A potentially more intuitive way to explain this is to why this reordering is valid:
store x = 0
acquire operator
release operation
load x < -- okay to use zero

This is valid because the first pair of operations and the second pair can both be reordered. i.e. the intermediate state is:
acquire operation
store x = 0
load x < -- now obviously zero
release operation

What is "discriminating context"? What does it mean to "clobber"?


This is a bit off topic for this review, but I may have spotted a bug here. What if the query instruction is a RMW operation? These can have ordering semantics, but the existing code (from your previous change) would say there's no dependence.


I think this check is wrong. (Independent of whether the approach is valid or not.)

My reasoning is that, a cst operation is "at least" an acquire. However, it's _also_ a release and reordering these two is not legal:
load cst y
store x = 0

This would be a violation of a "release operation" ordering as used by C++11. It's unclear whether that would violate the LLVM spec as written, but I think it probably should.

Note that if my reasoning is sound, this is actually a problem with the previous patch as well, not just this one.


As currently specified in the LangRef, I don't think this is legal. If I'm reading the spec right, the load must be considered both an Acquire and a Release. (This is possibly not intended, but seems to follow from the wording.) As a result, this is not a acquire/release pair, but both a acquire/release and release/acquire pair.

Finding a case where this is observable would be hard, but it seems possible. If need be, I'll spend some time thinking about it.


I don't believe this is valid.


This is valid, but not necessarily for the reason you gave. %y can move above the release. %x can move below the acquire. Once this happens, %x & %y can be commoned.



Thank you very much for the careful reviews !

Inline comments below with answers to your comments.


The reordering approach is not enough to explain this optimisation, as there is no way of hoisting the store of x back above the acquire operation in the example you gave.

I will try to give a more detailed explanation below, if you find it clearer I will put it in the comments.
In the following code:

store x = 0
release operation (1)
acquire operation (4)
%val = load x

it is not okay to replace %val by 0, because another thread may be doing:

acquire operation (2)
store x = 42
release operation (3)

And if the program ensures that 1 synchronizes with 2, and 3 with 4 then this code is correct and %val ends up being 42 and not 0.

A key property of the above program, is that if either (1) or (4) are absent, there would be a race between the store x = 42 and either the original store or the subsequent load (so the whole program is undefined). It can be shown (mostly through an excruciatingly boring case analysis) that every such program where the optimisation is visible needs such a pair of release-acquire pair for synchronisation between the threads or is racy.

(Discriminating context means "any number of threads which would make this optimisation visible if they were running concurrently with that one". I should indeed remove it as it seems unclear).


Indeed, thank you very much for find this.. it is incredible how hard to get right these things appear to be. I will send another patch shortly fixing it.
I would suggest making isAtomic()/isSimple() methods of all instructions (just return false/true respectively for the instructions that do not override it) and just checking that. In this way There would be much less risk in the future of forgetting one such case (there is also CmpXchg for example). Does this sound reasonable ?


I am pretty sure a cst operation only behaves as a release if it stores something (and as an acquire only if it loads something).

For example, the 29.3 section starts by defining the memory orders in this way:
[..] and memory_order_seq_cst: a store operation performs a release operation on the affected memory location.
This is also true in the formalisation of Batty&al: a synchronizes-with relation can only be from a store or fence to a load or fence [http://www.cl.cam.ac.uk/~pes20/cpp/]

So for this purpose, seq_cst loads do behave mostly like acquire loads (the difference is the extra total order on all seq_cst operations, that is irrelevant for the reasoning in this patch).


In the LangRef, just under the first occurence of memory_order_seq_cst I see:
"In addition to the guarantees of acq_rel (acquire for an operation which only reads, release for an operation which only writes)"
It seems pretty clear to me that it follows the C++11 standard, and that a load cannot be considered a release, even if it is seq_cst.
If you see a case where this optimisation would be observable, I am extremely interested in it.


I agree, that was intended to be covered by the previous test, but it cannot hurt to add another one, I will do it.


For this optimisation to be invalid, @y would have to change between the two loads.
Such a store to @y would necessarily race with the first load (to %x) and make the whole program undefined.
So my argument is that this is correct because it can only be observed by racy programs.


Sure, sorry about forgetting that.


Yes, I didn't notice it.

comments inline.


I do find your explanation more clear. Putting that in a comment or documentation would be a good idea. This gives the intuition for the approach.

I'd avoid the term "discriminating context". It's extra jargon with no real gain.

FYI, when I think about the legality of the optimization in terms of reordering, I tend to think "if I did this, can I do that?" It doesn't necessarily imply that I actually need to perform the reordering. It simply means that I *could* and thus it is legal to do the optimization. Now, potentially I might get myself in trouble by assuming two conflicting reorderings, but I've never actually run into that. (yet) My experience is that if I can't find some series of valid reorderings that could enable an optimization, it usually isn't actually correct. :)


This approach seems reasonable to me.

Alternately, you could use mayReadMemory() as your generic fallback check.


You're right. I went and checked the spec and my interpretation was incorrect.


Thanks for pointing out that wording. I was looking in acq_rel. Any chance you could move that to be under acq_rel? It seems odd to have a later order defining a previous one.

For this particular example, it's only valid because we know @x and @y are distinct locations. If we didn't know that, we couldn't remove the first store without changing %x, and thus the observable value.


I had to think about this one a bit. :) I can't find an argument to refute your racy program one, but I find that answer disturbing. From a practical software engineering perspective, the answer "oh, we corrupted all of the runtime state in hard to explain ways because you had one race somewhere" is a bit hard to swallow. I will admit it's what the c++11 standard says though.

I'm curious, do you believe the answer changes if both loads are monotonic? If not, this would seem to imply that LLVM can't implement the Java memory model correctly. The JMM does specify limited semantics for racy programs.

I will update this patch with the clearer comment/documentation ASAP.

Comments inline about the other points of discussion.


mayReadOrWriteMemory seems like the perfect fallback indeed, I will send a patch using it.


It is not under the acq_rel section, because it is impossible to have an acq_rel load or store, only RMW/CmpXchg operations can be acq_rel.

For the non-aliasing condition I agree, but that is a completely unrelated check: it also applies even if y was non-atomic (and is correctly checked by MemoryDependency).


About the practical software engineering issue, it could maybe be mitigated by warnings (although I am not sure how exactly to give helpful warnings for this) and by the use of thread sanitizer. Do you think a discussion on LLVM-dev would be warranted about the topic of how aggressively/non-intuitively we may optimize code ? I admit I was focusing only on the correctness with regards to the standard and not necessarily with regards to the programmer's expectations. On the other hand, the programmer is not supposed to use atomics if he does not know what he's doing... (and he will probably have a bad time otherwise even if the compiler is not aggressively optimizing)

This specific argument does not work if the loads are monotonic (as races involving relaxed accesses are well-defined in C++11). About the question of whether it is true for another reason, I don't know.
We have a paper under review (we = Viktor Vafeiadis and Thibault Balabonski mostly) about what kind of optimizations are possible on atomics themselves (such as would be the case here if the loads are monotonic). I do not remember what it says about this specific situation (and don't have a copy of it with me right now), but we generally found that almost every optimization (even seemingly "obviously true" ones) break horrendously in horrifyingly subtle and evil ways as soon as they involve monotonic/relaxed accesses. So I plan to avoid messing with these kinds of accesses, at least until/unless their semantics is modified/fixed by the C++ committee.

Java accesses are "unordered" from what I understand. It is because I have no idea what their exact semantics is (especially in the LLVM framework based on C11) that I picked isSimple() in the previous patch (although I had forgotten them in the beginning).

morisset updated this revision to Diff 12824.Aug 21 2014, 7:07 PM

Mostly add to the big comment, and improve a bit the tests.

LGTM. I'm still not 100% comfortable with this, but I've been unable to find a counter example and arguments given are solid. I don't want to hold this up any further.

reames accepted this revision.Aug 29 2014, 11:45 AM
This revision is now accepted and ready to land.Aug 29 2014, 11:45 AM
morisset closed this revision.Aug 29 2014, 2:09 PM