PNaCl needs this so optimization/targetmachine passes don't reorder the initial load upon translation to a native target. This also reflects what the function comment specifies.
Diff Detail
Event Timeline
This isn't just for PNaCl, so the description should be updated. You should mention something along the lines of:
On weak memory systems the CPU can speculate on subsequent loads (e.g. the cmpxchg) and observe them without honoring the happens-before ordering of the corresponding stores. This is the "load buffering" problem in literature, and occurs on ARM and POWER.
lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
557 | I think this should be an unconditional InitLoaded->setOrdering(AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder)); and there shouldn't be an IsRelaxed parameter. This will give the leading load a memory order that's valid for loads, and corresponds to the ordering the RMW instruction had. |
Unconditionally make the initial load atomic and fix the resulting X86 test failures.
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll | ||
---|---|---|
7 ↗ | (On Diff #33517) | Could you also test the other memory orderings in separate test functions? I mainly want to ensure that an acq_rel RMW operation gets a load acquire (the others should have the same ordering). |
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll | ||
---|---|---|
7 ↗ | (On Diff #33629) | Also relaxed, acquire, release (which should stay as-is). |
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll | ||
---|---|---|
7 ↗ | (On Diff #33633) | Sorry, I didn't realize you asked for the others until after I checked Done. Also, s/relaxed/monotonic/. |
I don't completely understand your argument here. The load buffering pattern (LB) occurs where there is a load to a location, followed by a store to a different location (and I agree that in that case the store can be executed before the load on ARM/Power). Here there is a load followed by a cmpxchg to the same location. No cache-coherent processor I know of can do anything tricky in such a case (accesses to the same location are always globally ordered, and in a way compatible with the program order). The only thing that I could see occurring is for the load to return stale data, which would only lead the cmpxchg to fail, and to an extra iteration of the loop. Am I missing something (likely considering it is 1 am) ?
So I agree that it deserves a comment to explain why it is safe but I still believe the original code safe; and I would rather avoid these extra cmpxchg8b/16b your change introduces unless they are necessary.
Extra remark: if the risk is indeed a LB pattern, then x86 is immune to it anyway, which makes me doubt the utility of these tests. Although in that case I don't have any better ones to suggest, since x86 is the only backend to exercise this function.
After thinking some more about it, I realized that having the load stay non-atomic is actually unsound because there might be a race, which would make it undefined behaviour.
Sadly, making it monotonic would still introduce these redundant cmpxchg8b/16b since monotonic requires the access to be atomic.
I don't know well enough the semantics of unordered accesses to know if they would do the trick, and the llvm reference manual description of them is fairly lightweight on details.
@reames: do you know if unordered has the right strength for allowing races without forcing the backend to use cmpxchg8b/16b instead of two mov ? If no, do you have a better suggestion ?
Thanks for chiming in. Maybe @jyasskin has an opinion from an LLVM memory model point of view.
From http://llvm.org/docs/Atomics.html#unordered, "note that an unordered load or store cannot be split into multiple instructions."
You can't use a normal access because that allows a later pass to change it to:
%1 = load %ptr
%2 = op %1
%oops = load %ptr
cmpxchg %ptr, %oops, %2
Bleh. With the current instructions, I don't think there's a good answer here.
I talked to @jyasskin in person: signal fence would work but isn't quite what we want. We would want a new type of atomic that participates in ordering, but is allowed to tear. This doesn't exist, but we can emulate it by:
- Changing the pass to be parameterize with "largest efficient atomic operation for this target". This is effectively the widest is_lock_free value.
- Passing this number when initializing the target's pass list.
- Having the pass explicitly tear too-wide loads into sub-loads of the type from 1., and reassembling the value from that.
This would manually create the tearing (tests would be the same as before) but the load would be atomic.
@morisset, does that sounds good?
It sounds great to me (although a lot of work for such a corner case, but I don't see any better solution).
@jyasskin I took a look at implementing what we discussed, and it's quite a bit uglier than I though it would be:
- We want the same value that factors into ATOMIC_*_LOCK_FREE, which is MaxAtomicInlineWidth from tools/clang/lib/Basic/Targets.cpp (each Target should define a value there, or gets 0 by default).
- We can pass this to the backend through LLVM's TargetOptions, which clang's BackendUtil.cpp can initialize.
- AtomicExpandPass has a TargetMachine, which has TargetOptions.
This will work for code that come straight from C++, but won't work for code that only uses opt: clang has the information we need, and LLVM doesn't. This is pretty much something that should be in the DataLayout instead, which is a *very* involved change! WDYT? Am I missing something obvious?
Richard and I don't think you want that value. On x86-64, the maximum width here is actually 128 bits, but you want to split loads down to 64 bits.
- We can pass this to the backend through LLVM's TargetOptions, which clang's BackendUtil.cpp can initialize.
- AtomicExpandPass has a TargetMachine, which has TargetOptions.
This will work for code that come straight from C++, but won't work for code that only uses opt: clang has the information we need, and LLVM doesn't. This is pretty much something that should be in the DataLayout instead, which is a *very* involved change! WDYT? Am I missing something obvious?
You're probably right, but I don't know enough about the backends to be sure. It does seem like the backend should know the widths of atomics it supports.
It seems to me that this is a x86-64 only issue, in that only x86-64 wants to split >64bit atomic loads in two. Perhaps we could add a load attribute, ie tearable, that would signal to the backend that this particular load could be torn without violating frontend assumptions? The ability to tear this initial load seems to me to be a property of this particular transform, and not of general IR (obviously). WDYT?
Given this: https://blog.chromium.org/2017/05/goodbye-pnacl-hello-webassembly.html
I think this patch can be closed as "won't fix".
I think this should be an unconditional InitLoaded->setOrdering(AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder)); and there shouldn't be an IsRelaxed parameter. This will give the leading load a memory order that's valid for loads, and corresponds to the ordering the RMW instruction had.