This is an archive of the discontinued LLVM Phabricator instance.

Switch over targets to use AtomicExpandPass, and clean up target atomics code.
Needs ReviewPublic

Authored by jyknight on Mar 15 2016, 3:39 PM.

Details

Reviewers
javed.absar
jfb
Summary
  • Removed __sync_* libcall names from the default set of libcalls, so that they cannot be accidentally invoked, unless requested. Targets call initSyncLibcalls() to request them where they're supported and required. (This is ARM and MIPS16 at the moment)
  • Deleted 'enableAtomicExpand' TargetLowering function: it's always enabled if you add the pass.
  • Removed unnecessary selection dag expansions of ATOMIC_LOAD into ATOMIC_CMP_SWAP and ATOMIC_STORE into ATOMIC_SWAP. Targets that don't support atomic load/store should now be handled by the translation in AtomicExpandPass into atomic_load/atomic_store, rather than expanding into unnatural operations.
  • Cleaned up conditionals in Target ISel code to handle too-large atomic operations, where such operations are now expanded before hitting the target's code.

ARM is the most complex target, as it's the only one which mixes all
the different options for atomic lowering, depending on the exact
subarchitecture and OS in use:

  1. AtomicExpandPass expansion to LL/SC instructions, where available natively (ARMv6+ in ARM mode, ARMv7+ in Thumb mode).
  1. Passthrough of atomicrmw and cmpxchg in AtomicExpandPass, followed by ISel expansion to __sync_* library calls, when native instructions are not available, but when the OS does provides a kernel-supported pseudo-atomic sequence. (This is Linux and Darwin, at the moment).
  1. AtomicExpandPass expansion to __atomic_* library calls, if neither of the above are possible for a given size on the given CPU architecture.

So, naturally, ARM required the most changes here.

Also of note are the XCore changes. I removed all the code handing
ATOMIC_STORE/ATOMIC_LOAD, because as far as I can tell, XCore does not
have any cmpxchg support, and thus cannot claim to implement atomics
of any size.

Diff Detail

Event Timeline

jyknight updated this revision to Diff 50781.Mar 15 2016, 3:39 PM
jyknight retitled this revision from to Switch over targets to use AtomicExpandPass, and clean up target atomics code..
jyknight updated this object.
jyknight added subscribers: theraven, rnk, hfinkel and 4 others.

Targets call initSyncLibcalls() to request them where they're supported and required. (This is ARM and MIPS16 at the moment)

What's the benefit here, performance? I can't seem to find where they'd ever be used unset (only getATOMIC/getSYNC references them, and all callees I can find assert that they get a valid libcall).

Removed unnecessary selection dag expansions of ATOMIC_LOAD into ATOMIC_CMP_SWAP and ATOMIC_STORE into ATOMIC_SWAP. Targets that don't support atomic load/store should now be handled by the translation in AtomicExpandPass into atomic_load/atomic_store, rather than expanding into unnatural operations.

What if the target doesn't have those libcalls? Darwin doesn't, for example, and we're going to have to allow deploying back to older OSs for quite a while even if we added them immediately (and, as you say, they have to be in a .dylib so we can't ship them with the compiler).

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
1438–1440

Isn't this an assertion-level error? It doesn't seem to be actionable by the user in any reasonable way.

Targets call initSyncLibcalls() to request them where they're supported and required. (This is ARM and MIPS16 at the moment)

What's the benefit here, performance? I can't seem to find where they'd ever be used unset (only getATOMIC/getSYNC references them, and all callees I can find assert that they get a valid libcall).

I added this because I found that they were getting emitted in error on targets that should not be using them (in intermediate versions of this patch). So, I added that as essentially a safety check, to make it impossible for those expansions to be used where they should not be, which is almost everywhere.

Removed unnecessary selection dag expansions of ATOMIC_LOAD into ATOMIC_CMP_SWAP and ATOMIC_STORE into ATOMIC_SWAP. Targets that don't support atomic load/store should now be handled by the translation in AtomicExpandPass into atomic_load/atomic_store, rather than expanding into unnatural operations.

What if the target doesn't have those libcalls? Darwin doesn't, for example, and we're going to have to allow deploying back to older OSs for quite a while even if we added them immediately (and, as you say, they have to be in a .dylib so we can't ship them with the compiler).

To the more general question "what happens if a user uses an atomic function which requires an __atomic libcall on a platform that doesn't provide it": they'll get a linker error about it not being defined.

But that doesn't have anything to do with the ATOMIC_LOAD/ATOMIC_STORE dag expansions. The __atomic_* libcalls are only generated by AtomicExpandPass for atomics that are too large, or misaligned.

So the DAG expansions would only come into play if you somehow had a platform which claims (via setMaxAtomicSizeSupported) backend support for that size, but doesn't even have an atomic load or store instruction capable of that size. I know of no platform where that'd be possibly useful. Certainly none that Darwin supports.

That's not to say that these expansions didn't USED to get used -- they did, prior to this change. For two reasons:

1stly, most obviously, that AtomicExpandPass didn't take care of the "no atomics at all of this size" case before getting to ISel. That issue is now gone.

2ndly (and probably where your concern is coming from): On ARM, LLVM used to go through that expansion for some CPU models. That was never actually necessary, though, and this patch changes it to use a normal load/store instruction, with an appropriate barrier, instead. (And if a barrier instruction isn't available, it'll use a barrier libcall.)

E.g. see the changed behavior in test/CodeGen/ARM/atomic-load-store.ll:

 %val = load atomic i32, i32* %ptr seq_cst, align 4
-; THUMBONE: __sync_val_compare_and_swap_4
+; THUMBONE: ldr
+; THUMBONE: __sync_synchronize

And test/CodeGen/ARM/atomic-op.ll:

 store atomic i32 %val1, i32* %mem1 release, align 4
 store atomic i32 %val2, i32* %mem2 release, align 4
-; CHECK-M0: ___sync_lock_test_and_set
-; CHECK-M0: ___sync_lock_test_and_set
+; CHECK-M0: dmb
+; CHECK-M0: str r1, [r0]
+; CHECK-M0: dmb
+; CHECK-M0: str r3, [r2]

So, in summary, I don't think Darwin has anything to worry about here.

t.p.northover added a comment.EditedMar 16 2016, 7:38 PM

So the DAG expansions would only come into play if you somehow had a platform which claims (via setMaxAtomicSizeSupported) backend support for that size, but doesn't even have an atomic load or store instruction capable of that size. I know of no platform where that'd be possibly useful. Certainly none that Darwin supports.

Ugh, bloody Phabricator! This is what I meant to say:

I believe x86 falls into this category for 128-bit types. 8.1.1 of
Volume 3[1] gives some guarantees for normal operations up to 64-bits,
but says nothing about 128.

To get 128-bit guarantees you seem to need a LOCK prefix, which can't
be legally applied to a MOV (but can to cmpxchg16b and xchg). Even the
cmpxchg16b documentation says that "The processor never produces a
locked read without also producing a locked write" which, while not
directly applicable to MOV, is suggestive.

Finally, this is how GCC's libatomic actually implements these
operations, except less buggily (it seems Clang doesn't actually take
account of that cmpxchg16b note, so atomic loads are probably broken
now).

The broken state of Clang might be a get-out, but not trivially:

  • We can't expand to __atomic_load_16 because we can't rely on the correct .dylib being present for a long time.
  • We can't expand to __sync_load because it doesn't exist.

So the alternatives seem to be a new sync_load & sync_store (when
everything else uses cmpxchg16b) or to keep (and fix[2]) the existing
expansions. I think I'd prefer the latter.

Cheers.

Tim.

[1] https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
[2] I'm happy to do that after you're done; please don't think I'm
trying to foist fixing more of (probably) my bugs on you.

On the plus side, GCC itself seems to be as buggy as clang (i.e. it loads with a cmpxchg16b with no loop to ensure success) so maybe real processors are more forgiving than the documentation suggests.

Tim.

Sorry, yes. The handling of extra-wide loads/stores slipped my mind when writing the comment, although I was aware of it earlier. I'd note also that ARM actually has a similar thing: you need to use ll/sc for wide loads and stores, not normal loads and stores (excepting, of course, 64bit load/stores on CPUs that support LPAE, per the unresolved FIXME note there.)

See shouldExpandAtomicStoreInIR and shouldExpandAtomicLoadInIR for how this is handled in llvm these days (both before and after this patch).

What I meant to say is that I believe there's no situation where you should ever see an expansion of an atomic store to a `__sync_lock_test_and_set_* call or an atomic load to __sync_val_compare_and_swap_*` call, which is really all that this code was being used for.

After this change cmpxchg16b does (still) get emitted for 16-byte atomic load/store operations on x86-64, and no libcall is needed...when your CPU supports that instruction.

But, one thing that might affect Darwin, come to think of it. In clang, in http://reviews.llvm.org/D17933, and in LLVM here, we now correctly check if the architecture *actually has* the cx16 instruction when setting the maximum atomic size. It was previously assumed that all x86-64 cpus had it in some code, but not in other code, somewhat inconsistently. Thus, if you compile for an X86 CPU without cmpxchg16b support, you will (now) get `__atomic_*` libcalls. That's absolutely the correct behavior.

Whereas, before, it would pretend you always had lock-free 16-byte atomics in clang, and it emitted an llvm 'store atomic' instruction. LLVM's x86 backend would've not expanded to cmpxchg in shouldExpandAtomicStoreInIR, so you'd be left with a 16byte ATOMIC_STORE DAG node. Which would then be lowered to ATOMIC_SWAP and then `__sync_lock_test_and_set_16` via the removed code we're talking about. But if you're actually on a CPU without cmpxchg16b, there's no way to actually implement that function according to spec, since you're not supposed to use a mutex.

I don't think Darwin actually supports any pre-cx16 CPUs though, so this really shouldn't affect it. But perhaps somewhere a default CPU architecture needs to be set, if it's not already?

On the plus side, GCC itself seems to be as buggy as clang (i.e. it loads with a cmpxchg16b with no loop to ensure success) so maybe real processors are more forgiving than the documentation suggests.

I don't think there's any bug there. according to the manual (the rest of the paragraph you quoted):

To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

That is, the lock cmpxchg16b instruction didn't fail if it returns false. It always does both a load and a store. It's just that if the comparison was false, it stores back the just-loaded value, instead of the new value.

I don't think Darwin actually supports any pre-cx16 CPUs though, so this really shouldn't affect it. But perhaps somewhere a default CPU architecture needs to be set, if it's not already?

No, that should be fine. Thanks for the reassurances (and reminding me that AtomicExpand did it). I was initially confused by the "unnatural operations".

That is, the lock cmpxchg16b instruction didn't fail if it returns false. It always does both a load and a store. It's just that if the comparison was false, it stores back the just-loaded value, instead of the new value.

Ah yes, see what you mean there. I agree.

Thanks again for the explanations.

Tim.

jfb added inline comments.Mar 23 2016, 6:21 PM
include/llvm/Target/TargetLowering.h
168

Would be nice to document what it does otherwise.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
1438–1440

Yes, should probably be a separate change anyways?

lib/Target/AArch64/AArch64ISelLowering.cpp
10219

Is the assumption here that Size > 128 never gets to this code? Could that be asserted?

10232

"RMW"

lib/Target/ARM/ARMISelLowering.cpp
847–848

What happens now when ThreadModel is Single? Is that handled elsewhere?

12120

The comment confuses me: Thumb1 and pre-v6 never get to the else clause, right? The fence is for everyone else?

12180

Seems better to just change the name of the hasLdrex function instead of explaining it here :-)

12202–12204

Same here, size can't be > 64? Assert?

lib/Target/SystemZ/SystemZISelLowering.cpp
131

This comment seems kind of obvious from the code.

lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
56

This is correct for wasm (whereas I'm not certain for some of the other ISAs).

test/CodeGen/ARM/atomic-cmpxchg.ll
93

Missing CHECKs.

test/CodeGen/ARM/atomic-load-store.ll
91

Ugh we should eliminate redundant libcalls elsewhere...

test/CodeGen/ARM/atomic-op.ll
240

Colon after func4:

jyknight updated this revision to Diff 52422.Apr 1 2016, 2:03 PM
jyknight marked 5 inline comments as done.

Updates for review.

jyknight added inline comments.Apr 1 2016, 2:04 PM
lib/CodeGen/SelectionDAG/SelectionDAG.cpp
1438–1440

Yeah, you're right. Actually, I'm going to just remove this change entirely. You get an assertion failure:

llc: llvm/include/llvm/ADT/StringRef.h:73: llvm::StringRef::StringRef(const char *): Assertion `Str && "StringRef cannot be built from a NULL argument"' failed.

anyways, without it. With a stack trace pointing you at the problem.

lib/Target/AArch64/AArch64ISelLowering.cpp
10219

Yes; above 128 get expanded to libcalls by AtomicExpandPass before getting here.

lib/Target/ARM/ARMISelLowering.cpp
847–848

Yeah, this is handled in ARMTargetMachine.cpp, in addIRPasses: if ThreadMode == Single, it adds the LowerAtomic pass, which changes all atomic ops to non-atomic ops, and deletes all fences. That's actually purely dead code at this point. Committed separately.

12120

Thumb1 and pre-v6 do get in here. Then, an IR fence instruction is created, so that it will get expanded into the libcall. CPUs with actual fence instructions use special intrisincs instead, in order to make the particularly appropriate kinds of fence instruction needed here.

12180

Well, hasLdrex is actually a good name. I'm trying to explain why I'm *using* it here. But apparently failed -- I've reworded, hopefully more clearly.

test/CodeGen/ARM/atomic-cmpxchg.ll
93

Oops!

test/CodeGen/ARM/atomic-load-store.ll
91

Yes, it might be nice to have a redundant fence elimination pass. You could also sometimes remove a fence which is redundant with a neighboring atomic instruction.

E.g.: "atomic_store seq_cst; atomic_fence seq_cst" could potentially eliminate the fence on some (all existing?) targets.

But that's a whole other project.

Is this still work in progress?

will this land sometime?

wmi added a subscriber: wmi.Sep 14 2017, 5:14 PM

Any plan to push the patch recently? After https://reviews.llvm.org/rL312830, with better alignment information of atomic object, more atomic load/store are generated for 128 bits atomic object instead of atomic libcalls. Those 128bits atomic load/store are translated into sync_* libcalls on x86-64 target without cmpxchg16b support. This patch is needed for atomicExpandPass to generate atomic libcalls before isel to generate sync_* libcalls.

asb added a subscriber: asb.May 30 2018, 8:48 PM

Are there any plans to push this forwards?

In current LLVM, MaxAtomicSizeInBitsSupported defaults to 1024, contrary to the documentation. Since its addition in April 2016, a comment is included indicating this will be fixed in the next commit:

// TODO: the default will be switched to 0 in the next commit, along
// with the Target-specific changes necessary.
MaxAtomicSizeInBitsSupported = 1024;

Adjusting that comment and documenting the 1024 default in docs/Atomics.rst is an easy fix of course. At this point, is that the best next step? Or would it be worth adjusting the default MaxAtomicSizeInBitsSupported even without the rest of this patch? Affected in-tree targets can obviously just get a new setMaxAtomicSizeInBitsSupported(1024) call to preserve the previous behaviour, but this may trip up out-of-tree backends for little gain vs just fixing the docs.

Revisiting this topic after a long time: I see that this now seems to be handled in Clang since D38046?