This is an archive of the discontinued LLVM Phabricator instance.

[compiler-rt] Implement ARM atomic operations for architectures without SMP support
Needs ReviewPublic

Authored by kpdev42 on Dec 20 2021, 10:41 PM.

Details

Summary

ARMv5 and older architectures don’t support SMP and do not have atomic instructions. Still they’re in use in IoT world, where one has to stick to libgcc.

~~~

OS Laboratory. Huawei RRI. Saint-Petersburg

Diff Detail

Event Timeline

kpdev42 created this revision.Dec 20 2021, 10:41 PM
kpdev42 requested review of this revision.Dec 20 2021, 10:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2021, 10:41 PM
Herald added a subscriber: Restricted Project. · View Herald Transcript
mstorsjo accepted this revision.Jan 19 2022, 3:25 AM

I think this looks reasonable. I haven't tried it, but I guess we should proceed with it instead of holding it back, as there's nobody else reviewing it.

This revision is now accepted and ready to land.Jan 19 2022, 3:25 AM
joerg added a subscriber: joerg.Jan 19 2022, 5:38 AM

Correct me if I'm wrong, but I don't think this stubs are async signal safe nor will they work for preemptive multitasking systems?

Correct me if I'm wrong, but I don't think this stubs are async signal safe nor will they work for preemptive multitasking systems?

Those stubs are basically cas loops (https://en.wikipedia.org/wiki/Compare-and-swap) which are not much different from their SMP counterparts, except memory sync ops are not used. This should work normally in case of preemption (preempting thread will spend its quota in busy-wait). Signal can be a problem if it preempt a thread executing atomic op, but I wonder if SMP code handles this

This revision was landed with ongoing or failed builds.Feb 16 2022, 11:11 PM
This revision was automatically updated to reflect the committed changes.

I'm concerned providing these is going to cause issues. The provided implementation are not atomic. Blindly assuming that the user is compiling for a target that doesn't have pre-emptible threads seems like a bad idea.

How do you expect users to use these methods, anyway? clang shouldn't be generating calls to these methods. (It looks like it actually does in some cases on ARM, but that's not intended behavior.)

For users who want to pretend threads don't exist, we should provide a compiler flag. -mthread-model doesn't quite work right for this at the moment, but it would make sense to fix it, I think.

efriedma reopened this revision.Feb 17 2022, 2:21 AM

Reverted in 0389f2edf7

This revision is now accepted and ready to land.Feb 17 2022, 2:21 AM
efriedma requested changes to this revision.Feb 17 2022, 2:22 AM
This revision now requires changes to proceed.Feb 17 2022, 2:22 AM
lkail added a subscriber: lkail.Feb 17 2022, 2:24 AM

D120026 is merged now, which addresses the issue of the compiler generating __sync calls where it isn't supposed to.

Does anyone want to continue discussing what changes to compiler-rt would be appropriate? I didn't mean to completely shut down discussion with my comment.

Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2022, 2:00 PM

D120026 is merged now, which addresses the issue of the compiler generating __sync calls where it isn't supposed to.

Does anyone want to continue discussing what changes to compiler-rt would be appropriate? I didn't mean to completely shut down discussion with my comment.

@efriedma

Imagine we have the following piece of code in the program:

volatile int G;
int foo() { return __sync_add_and_fetch(&G, 1); }

Now we want having this built and running on armv5 platform. At the moment the only option we have is to use libgcc. Unfortunately this have one big disadvantage: we're only limited to Linux, because call to __sync_add_and_fetch boils down to Linux kernel user helper. We want this to work on other platforms also, and here is what compiler-rt good for.

However sync ops operations in compiler-rt use memory barriers, which doesn't work on armv5: any attempt to use memory barrier on the latter will result in SIGILL. As armv5 doesn't support SMP (but still supports preemptive multitasking) it's possible in out opinion to implement sync ops as a compare and swap loop without memory barriers. What's your opinion on this?

On a target that doesn't support SMP, you don't need memory barriers, sure. (I think we'd want a CMake flag to explicitly assert that you're only going to run the code on chips without SMP.)

That doesn't really solve your issue, though. To implement atomic compare-and-swap or rmw operations, you need to ensure your code doesn't get interrupted in the middle. If your system supports multithreading, and a thread can be preempted at any time, you need one of the following:

  1. A natively atomic operation, like strex.
  2. A way to temporarily turn off interrupts, so the thread can't be preempted for a short time.
  3. Kernel-assisted operations, like the Linux kernel provides.
  4. A separate lock.

None of these options work on armv5 without some sort of kernel assistance. (Well, technically, you can implement a spinlock with swp, but that's very inefficient.)

At the moment, in case of compiler-rt, __sync_add_and_fetch boils down to
__sync_add_and_fetch_N, where N is the size of data being fetched (4 for int).
The implementation of __sync_fetch_and_add_N does approximately the following:

  1. Sets memory barrier
  2. Calls atomic load from memory location
  3. Modifies data
  4. Calls atomic store to memory location
  5. Checks that operation is consistent, if not goes to step 2.

IMO, performance-wise there is not much difference (if any) between this and
modifying data with acquiring spinlock.

No code in compiler-rt disables interrupts, so it can and will be interrupted in the middle,
by a different thread however I don't see any problem in this.

Now if we are on a platform which doesn't support SMP we can use ordinary memory operations instead
of atomic ones, can't we?

  1. Sets memory barrier
  2. Calls atomic load from memory location
  3. Modifies data
  4. Calls atomic store to memory location
  5. Checks that operation is consistent, if not goes to step 2.

Now if we are on a platform which doesn't support SMP we can use ordinary memory operations instead
of atomic ones, can't we?

Even on a non-SMP processor, there's a bit of magic in ldrex/strex: ldrex sets a hidden "lock" bit, and strex checks it. If there's a context switch between the load and the store, the switch will clear that bit. So when the code continues to execute, the store fails.

kpdev42 updated this revision to Diff 432478.May 27 2022, 12:18 AM
kpdev42 edited the summary of this revision. (Show Details)

Well, after some investigation it turned out that:

  1. ARMv5 has DMB instruction in the form of mcr p15, #0, <Rd>, c7, c10, #5
  2. There is SWP instruction (deprecated on ARMv6), which does atomic exchange of 32-bit values

I've reimplemented sync ops using these primitves, PTAL
Theoretically this should work on ARMv6 and higher, though I didn't check this

ldr+op+swp still isn't atomic. For each point in the code, please try the exercise of "what if my code is interrupted here"?

The only way to use swp to implement general atomic operations is to use a separate spinlock.