This is an archive of the discontinued LLVM Phabricator instance.

Make `llvm::expandAtomicRMWToCmpXchg`'s initial load atomic.
ClosedPublic

Authored by DiamondLovesYou on Aug 25 2015, 2:44 PM.

Details

Reviewers
reames
jfb
Summary

PNaCl needs this so optimization/targetmachine passes don't reorder the initial load upon translation to a native target. This also reflects what the function comment specifies.

Diff Detail

Event Timeline

DiamondLovesYou retitled this revision from to Add a boolean parameter to make the initial load atomic..
DiamondLovesYou updated this object.
DiamondLovesYou added a reviewer: jfb.
DiamondLovesYou added a subscriber: llvm-commits.
jfb edited edge metadata.Aug 25 2015, 3:18 PM

This isn't just for PNaCl, so the description should be updated. You should mention something along the lines of:

On weak memory systems the CPU can speculate on subsequent loads (e.g. the cmpxchg) and observe them without honoring the happens-before ordering of the corresponding stores. This is the "load buffering" problem in literature, and occurs on ARM and POWER.

lib/CodeGen/AtomicExpandPass.cpp
556

I think this should be an unconditional InitLoaded->setOrdering(AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder)); and there shouldn't be an IsRelaxed parameter. This will give the leading load a memory order that's valid for loads, and corresponds to the ordering the RMW instruction had.

DiamondLovesYou edited edge metadata.

Unconditionally make the initial load atomic and fix the resulting X86 test failures.

DiamondLovesYou marked an inline comment as done.Aug 31 2015, 7:15 AM
jfb added inline comments.Aug 31 2015, 10:09 AM
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll
7

Could you also test the other memory orderings in separate test functions? I mainly want to ensure that an acq_rel RMW operation gets a load acquire (the others should have the same ordering).

DiamondLovesYou retitled this revision from Add a boolean parameter to make the initial load atomic. to Make `llvm::expandAtomicRMWToCmpXchg`'s initial load atomic..
DiamondLovesYou updated this object.

Fix comments.

DiamondLovesYou marked an inline comment as done.Aug 31 2015, 2:21 PM
DiamondLovesYou marked an inline comment as not done.Aug 31 2015, 2:22 PM
jfb added inline comments.Aug 31 2015, 2:28 PM
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll
7

Also relaxed, acquire, release (which should stay as-is).

jfb added a subscriber: morisset.Aug 31 2015, 2:28 PM

Adding Robin.

Added tests for the other atomic orderings.

DiamondLovesYou marked an inline comment as done.Aug 31 2015, 2:41 PM
DiamondLovesYou added inline comments.
test/Transforms/AtomicExpand/X86/expand-atomic-rmw-initial-load.ll
7

Sorry, I didn't realize you asked for the others until after I checked Done. Also, s/relaxed/monotonic/.

jfb accepted this revision.Aug 31 2015, 2:46 PM
jfb edited edge metadata.

lgtm, but leave open for a short while if @morisset wants to comment.

This revision is now accepted and ready to land.Aug 31 2015, 2:46 PM
In D12338#232597, @jfb wrote:

This isn't just for PNaCl, so the description should be updated. You should mention something along the lines of:

On weak memory systems the CPU can speculate on subsequent loads (e.g. the cmpxchg) and observe them without honoring the happens-before ordering of the corresponding stores. This is the "load buffering" problem in literature, and occurs on ARM and POWER.

I don't completely understand your argument here. The load buffering pattern (LB) occurs where there is a load to a location, followed by a store to a different location (and I agree that in that case the store can be executed before the load on ARM/Power). Here there is a load followed by a cmpxchg to the same location. No cache-coherent processor I know of can do anything tricky in such a case (accesses to the same location are always globally ordered, and in a way compatible with the program order). The only thing that I could see occurring is for the load to return stale data, which would only lead the cmpxchg to fail, and to an extra iteration of the loop. Am I missing something (likely considering it is 1 am) ?

So I agree that it deserves a comment to explain why it is safe but I still believe the original code safe; and I would rather avoid these extra cmpxchg8b/16b your change introduces unless they are necessary.

Extra remark: if the risk is indeed a LB pattern, then x86 is immune to it anyway, which makes me doubt the utility of these tests. Although in that case I don't have any better ones to suggest, since x86 is the only backend to exercise this function.

morisset added a subscriber: reames.

After thinking some more about it, I realized that having the load stay non-atomic is actually unsound because there might be a race, which would make it undefined behaviour.
Sadly, making it monotonic would still introduce these redundant cmpxchg8b/16b since monotonic requires the access to be atomic.
I don't know well enough the semantics of unordered accesses to know if they would do the trick, and the llvm reference manual description of them is fairly lightweight on details.

@reames: do you know if unordered has the right strength for allowing races without forcing the backend to use cmpxchg8b/16b instead of two mov ? If no, do you have a better suggestion ?

jfb added a subscriber: jyasskin.Sep 2 2015, 9:19 AM

Thanks for chiming in. Maybe @jyasskin has an opinion from an LLVM memory model point of view.

After thinking some more about it, I realized that having the load stay non-atomic is actually unsound because there might be a race, which would make it undefined behaviour.
Sadly, making it monotonic would still introduce these redundant cmpxchg8b/16b since monotonic requires the access to be atomic.
I don't know well enough the semantics of unordered accesses to know if they would do the trick, and the llvm reference manual description of them is fairly lightweight on details.

@reames: do you know if unordered has the right strength for allowing races without forcing the backend to use cmpxchg8b/16b instead of two mov ? If no, do you have a better suggestion ?

From http://llvm.org/docs/Atomics.html#unordered, "note that an unordered load or store cannot be split into multiple instructions."

You can't use a normal access because that allows a later pass to change it to:

%1 = load %ptr
%2 = op %1
%oops = load %ptr
cmpxchg %ptr, %oops, %2

Bleh. With the current instructions, I don't think there's a good answer here.

jfb added a comment.Sep 2 2015, 9:50 AM

Should we instead plop in a signal fence after the (non-atomic) load?

jfb added a comment.Sep 3 2015, 2:54 PM

I talked to @jyasskin in person: signal fence would work but isn't quite what we want. We would want a new type of atomic that participates in ordering, but is allowed to tear. This doesn't exist, but we can emulate it by:

  1. Changing the pass to be parameterize with "largest efficient atomic operation for this target". This is effectively the widest is_lock_free value.
  2. Passing this number when initializing the target's pass list.
  3. Having the pass explicitly tear too-wide loads into sub-loads of the type from 1., and reassembling the value from that.

This would manually create the tearing (tests would be the same as before) but the load would be atomic.

@morisset, does that sounds good?

It sounds great to me (although a lot of work for such a corner case, but I don't see any better solution).

jfb added a comment.Sep 14 2015, 11:07 AM

@jyasskin I took a look at implementing what we discussed, and it's quite a bit uglier than I though it would be:

  • We want the same value that factors into ATOMIC_*_LOCK_FREE, which is MaxAtomicInlineWidth from tools/clang/lib/Basic/Targets.cpp (each Target should define a value there, or gets 0 by default).
  • We can pass this to the backend through LLVM's TargetOptions, which clang's BackendUtil.cpp can initialize.
  • AtomicExpandPass has a TargetMachine, which has TargetOptions.

This will work for code that come straight from C++, but won't work for code that only uses opt: clang has the information we need, and LLVM doesn't. This is pretty much something that should be in the DataLayout instead, which is a *very* involved change! WDYT? Am I missing something obvious?

In D12338#245398, @jfb wrote:

@jyasskin I took a look at implementing what we discussed, and it's quite a bit uglier than I though it would be:

  • We want the same value that factors into ATOMIC_*_LOCK_FREE, which is MaxAtomicInlineWidth from tools/clang/lib/Basic/Targets.cpp (each Target should define a value there, or gets 0 by default).

Richard and I don't think you want that value. On x86-64, the maximum width here is actually 128 bits, but you want to split loads down to 64 bits.

  • We can pass this to the backend through LLVM's TargetOptions, which clang's BackendUtil.cpp can initialize.
  • AtomicExpandPass has a TargetMachine, which has TargetOptions.

This will work for code that come straight from C++, but won't work for code that only uses opt: clang has the information we need, and LLVM doesn't. This is pretty much something that should be in the DataLayout instead, which is a *very* involved change! WDYT? Am I missing something obvious?

You're probably right, but I don't know enough about the backends to be sure. It does seem like the backend should know the widths of atomics it supports.

DiamondLovesYou marked an inline comment as done.Nov 23 2015, 9:01 AM
In D12338#245398, @jfb wrote:

@jyasskin I took a look at implementing what we discussed, and it's quite a bit uglier than I though it would be:

  • We want the same value that factors into ATOMIC_*_LOCK_FREE, which is MaxAtomicInlineWidth from tools/clang/lib/Basic/Targets.cpp (each Target should define a value there, or gets 0 by default).

Richard and I don't think you want that value. On x86-64, the maximum width here is actually 128 bits, but you want to split loads down to 64 bits.

  • We can pass this to the backend through LLVM's TargetOptions, which clang's BackendUtil.cpp can initialize.
  • AtomicExpandPass has a TargetMachine, which has TargetOptions.

This will work for code that come straight from C++, but won't work for code that only uses opt: clang has the information we need, and LLVM doesn't. This is pretty much something that should be in the DataLayout instead, which is a *very* involved change! WDYT? Am I missing something obvious?

You're probably right, but I don't know enough about the backends to be sure. It does seem like the backend should know the widths of atomics it supports.

It seems to me that this is a x86-64 only issue, in that only x86-64 wants to split >64bit atomic loads in two. Perhaps we could add a load attribute, ie tearable, that would signal to the backend that this particular load could be torn without violating frontend assumptions? The ability to tear this initial load seems to me to be a property of this particular transform, and not of general IR (obviously). WDYT?

jfb closed this revision.May 7 2018, 10:01 PM

Given this: https://blog.chromium.org/2017/05/goodbye-pnacl-hello-webassembly.html
I think this patch can be closed as "won't fix".