This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Fix ARM backend to correctly use atomic expansion routines.
ClosedPublic

Authored by efriedma on Feb 17 2022, 2:04 AM.

Details

Summary

Without this patch, clang would generate calls to __sync_* routines on targets where it does not make sense; we can't assume the routines exist on unknown targets. Linux has special implementations of the routines that work on old ARM targets; other targets have no such routines. In general, atomics operations which aren't natively supported should go through libatomic (__atomic_*) APIs, which can support arbitrary atomics through locks.

ARM targets older than v6, where this patch makes a difference, are rare in practice, but not completely extinct. See, for example, discussion on D116088.

This also affects Cortex-M0, but I don't think __sync_* routines actually exist in any Cortex-M0 libraries. So in practice this just leads to a slightly different linker error for those cases, I think.

Mechanically, this patch does the following:

  1. Ensures we run atomic expansion unconditionally; it never makes sense to completely skip it.
  2. Fixes getMaxAtomicSizeInBitsSupported() so it returns an appropriate number on all ARM subtargets.
  3. Fixes shouldExpandAtomicRMWInIR() and shouldExpandAtomicCmpXchgInIR() to correctly handle subtargets that don't have atomic instructions.

Diff Detail

Event Timeline

efriedma created this revision.Feb 17 2022, 2:04 AM
efriedma requested review of this revision.Feb 17 2022, 2:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2022, 2:04 AM
efriedma edited the summary of this revision. (Show Details)Feb 17 2022, 2:05 AM
efriedma added a reviewer: jyknight.
lkail added a subscriber: lkail.Feb 17 2022, 2:26 AM
joerg added a comment.Feb 17 2022, 4:29 AM

I'm ambivalent about wether to use sync_* or atomic_*, but last time I looked at this, we generated the generic unsized variant a lot, even when the arguments have a known static size.

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 10:31 AM
john.brawn accepted this revision.Mar 16 2022, 10:26 AM

LGTM, but please adjust the comment in ARMISelLowering.cpp as suggested by the clang-format linter before committing.

This revision is now accepted and ready to land.Mar 16 2022, 10:26 AM
This revision was landed with ongoing or failed builds.Mar 18 2022, 12:44 PM
This revision was automatically updated to reflect the committed changes.
aykevl added a subscriber: aykevl.Mar 22 2022, 7:46 AM

This also affects Cortex-M0, but I don't think __sync_* routines actually exist in any Cortex-M0 libraries. So in practice this just leads to a slightly different linker error for those cases, I think.

TinyGo implements these __sync_* routines so that atomic operations work. See https://github.com/tinygo-org/tinygo/blob/v0.22.0/src/runtime/atomics_critical.go#L188. I believe the __sync_* variants only exist for Cortex-M / ARM, other targets use __atomic_*. That said, I'm generally in favor of this change because it looks more correct and aligns with atomics on other LLVM backends.

I suspect Rust Embedded also implements these routines (I think I've seen them somewhere) but I can't find it easily.

Related: https://reviews.llvm.org/D61052

llvm/test/CodeGen/ARM/atomic-op.ll
414

This looks like a regression. I'm pretty sure that these loads and stores are always atomic. Is there a reason why this has to go through a library function?

efriedma added inline comments.Mar 22 2022, 11:08 AM
llvm/test/CodeGen/ARM/atomic-op.ll
414

In general, __atomic_* routines are not guaranteed to be lock-free. Suppose ___atomic_compare_exchange_4 is running at the same time on a different thread; if we store directly to the address, we're ignoring whatever lock ___atomic_compare_exchange_4 uses internally.

Because of that, unless we have some prior knowledge that libatomic supports a lock-free cmpxchg with the given width, all atomic operations have to go through libatomic. It doesn't matter if load or store ops are atomic at a hardware level.

This isn't a problem with __sync_* because they're guaranteed to be lock-free.

See also https://llvm.org/docs/Atomics.html#libcalls-atomic .

Interesting, this commit results in several testing timeouts in our downstream compiler for several C++ compliance tests (ACE, plumhall, perennial) for Cortex-M0/M0plus only. I confess I don't understand the nature of this commit or how it would impact runtime here, other than you indicate that this could impact Cortex-M0. Do you have any pointers as to what I could verify here?

The impact on Cortex-M0 is that instead of using __sync_*, and inlined atomic load/store ops, it uses __atomic_* calls. I can't imagine that causing timeouts unless you're somehow using a bad implementation of __atomic_*.

The impact on Cortex-M0 is that instead of using __sync_*, and inlined atomic load/store ops, it uses __atomic_* calls. I can't imagine that causing timeouts unless you're somehow using a bad implementation of __atomic_*.

Hmm. It appears to be getting stuck in the lock and is spinning forever. Something isn't being built correctly in the builtins, maybe?

By "the builtins", do you mean you're using compiler-rt with -DCOMPILER_RT_EXCLUDE_ATOMIC_BUILTIN=Off? That implementation probably don't work on a Cortex-M0, unless you've hacked it somehow; it assumes width-1 lock-free atomics are available.

By "the builtins", do you mean you're using compiler-rt with -DCOMPILER_RT_EXCLUDE_ATOMIC_BUILTIN=Off? That implementation probably don't work on a Cortex-M0, unless you've hacked it somehow; it assumes width-1 lock-free atomics are available.

Yes, that's exactly what we're doing. What should we be doing instead for Cortex-M0? I don't know a lot about the history of this implementation, but if it doesn't work for M0, why is your change considered acceptable for M0?

By "the builtins", do you mean you're using compiler-rt with -DCOMPILER_RT_EXCLUDE_ATOMIC_BUILTIN=Off? That implementation probably don't work on a Cortex-M0, unless you've hacked it somehow; it assumes width-1 lock-free atomics are available.

Yes, that's exactly what we're doing. What should we be doing instead for Cortex-M0? I don't know a lot about the history of this implementation, but if it doesn't work for M0, why is your change considered acceptable for M0?

There is no way to implement an "atomic" operation on an M0 besides turning off interrupts, and we don't want the builtins library to mess with the interrupt mask. I'd expect the OS/user to provide the functionality, if they need it. compiler-rt doesn't provide __sync_* for cortex-m0, for the same reason.

I'm not sure how your setup was working before. I guess maybe if you only need atomics for C++ static local vars, and you built libc++abi with LIBCXXABI_HAS_NO_THREADS? If you don't actually need threading/atomics, you can use -fno-threadsafe-statics and/or -mthread-model single.

There is no way to implement an "atomic" operation on an M0 besides turning off interrupts, and we don't want the builtins library to mess with the interrupt mask. I'd expect the OS/user to provide the functionality, if they need it. compiler-rt doesn't provide __sync_* for cortex-m0, for the same reason.

I'm not sure how your setup was working before. I guess maybe if you only need atomics for C++ static local vars, and you built libc++abi with LIBCXXABI_HAS_NO_THREADS? If you don't actually need threading/atomics, you can use -fno-threadsafe-statics and/or -mthread-model single.

OK thank you for answering my questions. Our downstream compiler targets primarily baremetal systems on embedded devices, so we can't rely on an OS provided implementation of atomics. I suspect we're not the only downstream compiler impacted. I will try the options you suggest and let you know, but we may have to do something else to work around the effects of this commit. We weren't doing anything special other than what was out of the box, so this change took us by surprise.

It probably makes sense to add #error to atomics.c, I guess, so that particular mistake gets caught at compile-time, not runtime.

Beyond that, though, I'm not sure what else I can do here. I guess I could set setMaxAtomicSizeInBitsSupported(32) for Cortex-M0, even though the target doesn't really have such atomics, and document "users must provide __sync_*_{1,2,4}". That seems a bit weird, though; like I mentioned before, we have no way to actually provide them. (I guess we could provide them off-by-default, or something, but again, I don't want to mess with the interrupt mask behind the user's back.)

nikic added a subscriber: nikic.Jul 23 2022, 11:42 AM
nikic added inline comments.
llvm/test/CodeGen/ARM/atomic-op.ll
413

Is the dmb still needed if we're going through __atomic?

efriedma added inline comments.Jul 23 2022, 3:10 PM
llvm/test/CodeGen/ARM/atomic-op.ll
413

__atomic_load_4 is just supposed to be a sequentially consistent load, and a sequentially consistent load doesn't imply a full fence, even if the load is sequentially consistent. So there isn't any obvious rule that would allow that transform.

See also https://github.com/llvm/llvm-project/issues/29472, https://github.com/llvm/llvm-project/issues/56450 .

nikic added inline comments.Jul 23 2022, 3:27 PM
llvm/test/CodeGen/ARM/atomic-op.ll
413

Uh sorry, I missed that there was an explicit IR fence instruction in this test. The preceding two functions don't have an IR fence, and also no longer emit dmb, so everything is good here.

nikic added a comment.Jul 24 2022, 3:48 AM

I think we're going to need an option to restore the previous behavior for Rust. The context here is that Rust targets can separately specify up to which size the support atomic load/store, and whether they support atomic CAS. Historically, the thumbv6m target allowed atomic load/store and disabled CAS, relying on the previous LLVM behavior of emitting atomic load/store as a memory barrier + simple load/store.

Of course, this makes Rust code using atomics incompatible with any C code using atomic CAS via libatomic (or I guess any other CAS implementation without OS support). There is an implicit assumption here that Rust code for this target will never get linked against C code using libatomic. I think this assumption was made by accident, but it's probably a fairly reasonable assumption to make for common use-cases, especially for baremetal targets.

Based on the feedback I received, removing support for atomic load/store for these targets would be a major breaking change for the embedded ecosystem. The other bit of context here is that Rust atomics are required to be lock-free -- I guess the alternative would be to go back on that and actually use libatomic, but that would probably come with its own complications, especially for this kind of target.

My thinking here is that we could add a target feature which basically says "I promise I won't use atomic CAS", in which case atomic load/store can be correctly lowered as before. Does that sound reasonable? It's worth noting that there is also interest in something like this for other targets, such as riscv without the A extension.

I see two possible approaches to restore the functionality you want.

One, you could add a target feature that says "I have 32-bit (or 64-bit) atomics". That would override the normal rule for setMaxAtomicSizeInBitsSupported. The target environment would be expected to provide whatever __sync_* functions are necessary. If you don't actually use any atomic operations that require CAS, you won't see any calls, so everything works even if the functions don't actually exist. (Or it's possible to write an operating system that implements lock-free CAS on Cortex-M0, the same way the Linux kernel implements atomics on old ARM cores.)

Two, you could relax the constraint that atomic load/store are required to be atomic with respect to cmpxchg. We can add a value to syncscope to represent this. This would allow mixing code that uses load-store atomics with code that requires "real" atomics.

That said, I'm not sure how you use load-store atomics in practice. I mean, I guess you can use Dekker's alogorithm, or some kinds of lock-free queues, but that seems a bit exotic. Or do you use some sort of lock implemented by disabling interrupts?

nikic added a comment.Jul 25 2022, 6:32 AM

I see two possible approaches to restore the functionality you want.

One, you could add a target feature that says "I have 32-bit (or 64-bit) atomics". That would override the normal rule for setMaxAtomicSizeInBitsSupported. The target environment would be expected to provide whatever __sync_* functions are necessary. If you don't actually use any atomic operations that require CAS, you won't see any calls, so everything works even if the functions don't actually exist. (Or it's possible to write an operating system that implements lock-free CAS on Cortex-M0, the same way the Linux kernel implements atomics on old ARM cores.)

I think this is the solution we want. I've put up https://reviews.llvm.org/D130480 to implement this.

Two, you could relax the constraint that atomic load/store are required to be atomic with respect to cmpxchg. We can add a value to syncscope to represent this. This would allow mixing code that uses load-store atomics with code that requires "real" atomics.

That said, I'm not sure how you use load-store atomics in practice. I mean, I guess you can use Dekker's alogorithm, or some kinds of lock-free queues, but that seems a bit exotic. Or do you use some sort of lock implemented by disabling interrupts?

Some use cases are mentioned starting from here: https://github.com/rust-lang/rust/issues/99668#issuecomment-1193417939 So it seems to be mostly about synchronization with interrupt handlers, lock-free spsc queues, as well as ability to use atomics as safe mutable globals (a rust-specific issue). And yes, disabling interrupts for critical sections seems to be common practice in this context as well.