This is an archive of the discontinued LLVM Phabricator instance.

[compiler-rt][builtins] Implement some fetch-and-x operations for Cortex-M
AbandonedPublic

Authored by aykevl on Apr 23 2019, 6:57 PM.

Details

Summary

This patch adds support for the following builtins on the Cortex-M0 and up:

__sync_fetch_and_add_4
__sync_fetch_and_add_8
__sync_fetch_and_and_4
__sync_fetch_and_and_8
__sync_fetch_and_nand_4
__sync_fetch_and_nand_8
__sync_fetch_and_or_4
__sync_fetch_and_or_8
__sync_fetch_and_sub_4
__sync_fetch_and_sub_8
__sync_fetch_and_xor_4
__sync_fetch_and_xor_8

The assembly for Cortex-M3 and up could be slightly more efficient, but I've chosen to keep it simple so that all ARM processors can deal with it.

Diff Detail

Event Timeline

aykevl created this revision.Apr 23 2019, 6:57 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 23 2019, 6:57 PM
Herald added subscribers: Restricted Project, llvm-commits, kristof.beyls and 2 others. · View Herald Transcript
aykevl updated this revision to Diff 196370.Apr 23 2019, 7:00 PM
aykevl edited the summary of this revision. (Show Details)
aykevl set the repository for this revision to rCRT Compiler Runtime.

If someone could suggest a good way to test this, that would be great. At the moment it has been tested by compiling for armv6m, armv7m, and armv7em and confirming that the add operations appear to work.

Is there any reason to avoid 32-bit ldrex/strex on M-class processors other than Cortex-M0?

I haven't really done any systems programming on ARM, so I'm not sure what the consequences of the sequence you use for enabling/disabling interrupts might be. Is it reasonable for a library to turn off interrupts for a few cycles? Is it possible that the value of primask might change after the mrs, but before the cpsid instruction?

Is there any reason to avoid 32-bit ldrex/strex on M-class processors other than Cortex-M0?

No there isn't any reason to avoid ldrex/strex on M-class. It is the recommended way of implementing these functions on all architectures that support them.

I haven't really done any systems programming on ARM, so I'm not sure what the consequences of the sequence you use for enabling/disabling interrupts might be. Is it reasonable for a library to turn off interrupts for a few cycles?

It depends on your application. If the application requires hard real time low latency guarantees for interrupt response time but also uses concurrency elsewhere for non-real time parts then disabling interrupts for these functions could be a serious problem. You typically wouldn't use a cortex-m0 for such an application but a cortex-m3/m4 is certainly possible. Personally I think that we should be using ldrex/strex on all architectures that support them.

Is it possible that the value of primask might change after the mrs, but before the cpsid instruction?

In theory

  • interrupts could be enabled at the point of the mrs instruction
  • an interrupt could occur in between the mrs and the cpsid
  • that interrupt could disable further interrupts and not re-enable them (unlikely) but possible.
  • the cpsid will disable the interrupts (no-op as they are already disabled)
  • the msr will re-enable interrupts

I'm struggling to think of a use case of an asynchronous interrupt that would disable further interrupts and rely on some other (non interrupt) part of the code to re-enable them.

One other point, are these files compiled for cortex-m0? My reading of builtins/CMakeLists.txt with:

set(thumb1_SOURCES
  arm/divsi3.S
  arm/udivsi3.S
  arm/comparesf2.S
  arm/addsf3.S
  ${GENERIC_SOURCES}
)
...
set(thumb1_SOURCES
  ${thumb1_SOURCES}
  ${arm_EABI_SOURCES}
)
...
set(armv6m_SOURCES ${thumb1_SOURCES})

The sync_fetch_and_add_4.S etc. are only defined for arm_Sources which are v7 and above. If I'm right you'd need to alter the build-system for v6m for these files to be included in the library? Am I missing something?

aykevl planned changes to this revision.EditedApr 30 2019, 11:40 AM

No there isn't any reason to avoid ldrex/strex on M-class. It is the recommended way of implementing these functions on all architectures that support them.

Ok, I'll update the code to use ldrex/strex on armv7m and up for 32-bit operations. I don't see a way of using ldrex/strex for 64-bit operations on any of the Cortex-M devices, though (unless the more capable ones have 64-bit ldrex/strex instructions).

I haven't really done any systems programming on ARM, so I'm not sure what the consequences of the sequence you use for enabling/disabling interrupts might be. Is it reasonable for a library to turn off interrupts for a few cycles?

I have been thinking about this as well but if the only possible implementation is by disabling interrupts, I think we should do that. If anyone knows a better way on Cortex-M0 please tell.

I'm struggling to think of a use case of an asynchronous interrupt that would disable further interrupts and rely on some other (non interrupt) part of the code to re-enable them.

That was my thinking as well. Although it is possible, I think it would be bad practice. Many RTOSes have macros for disabling/enabling interrupts using the same pattern.

One other point, are these files compiled for cortex-m0?

Good point, will take a look. I haven't used the CMake-based built system for my application so I didn't need it but will make sure it works with it.

aykevl updated this revision to Diff 197554.May 1 2019, 8:08 AM
  • use the ldrex/strex implementation on Cortex-M3 and up for SYNC_OP_4
  • fixed a bug in SYNC_OP_4 for the sub instruction (a bug remains in SYNC_OP_8, will fix soon)
aykevl updated this revision to Diff 197559.May 1 2019, 8:41 AM
  • fixed the bug in SYNC_OP_8

Thanks for the update.

If someone could suggest a good way to test this, that would be great. At the moment it has been tested by compiling for armv6m, armv7m, and armv7em and confirming that the add operations appear to work.

While it will be difficult to test for threading problems, it would be possible to test these functions in a single threaded context by adding test cases in compiler-rt/test/builtins/Unit. We'd need to be careful that these only ran when the functions are supported though. Getting these tests to run for native builds would be simple, cortex-m would be more difficult as we'd have to cross compile and run in an emulator.

lib/builtins/arm/sync-ops.h
14

Probably want to use https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html as this will track the latest release of GCC.

34

Strictly speaking I think we need more guards here, although this would have been a problem prior to this patch.

  • ldrex and strex were introduced in ARM state in armv6 (that is processors prior to A, R, M profile split and not to be confused with armv6-m!)
  • ldrex and strex were introduced in Thumb state in armv6t2 (arm-1156t2-s).
  • Prior to that arm cpus had a SWP instruction that was used to implement an atomic swap, however has been removed in v8.
  • disabling interrupts with cpsid is specific to M class.

I think fixing that is probably best done in another patch as I suspect very few people are trying to build compiler-rt for these older platforms.

52

As an alternative, which I've seen used on some other toolchains, is to just not define the _8 variants on Cortex-M. These are not widely used and it might be better for a user to get an undefined symbol error so that they could choose to implement their program differently or implement their own _8 variant?

One of the main advantages of ldrex, strex is that the exclusive monitor that they set on the address can be read by external peripherals, disabling interrupts may not help in this case so it may be best to not use an interrupt disabled version here.

aykevl updated this revision to Diff 197591.May 1 2019, 10:43 AM
  • updated the URL to the GCC docs
aykevl marked 3 inline comments as done.May 1 2019, 11:35 AM
aykevl added inline comments.
lib/builtins/arm/sync-ops.h
52

Perhaps you're right. Especially in low-latency applications, disabling interrupts in library calls that were inserted by the compiler is probably unexpected. I've checked with arm-none-eabi-gcc version 5.4.1 (as packaged in Debian stretch), and it is not included in the provided libc.
So maybe compiler-rt should not implement this? This would mean:

  • armv6m provides no atomic operations at all, as the required instructions are missing. They should be implemented by the application in some appropriate way. I think this should be documented.
  • armv7m and up provide atomic operations for 32-bit values only.

Or perhaps better: we should teach LLVM to emit the proper instruction sequences inline (for armv7m and up) instead of calling a library function.

Some related reading: https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary

One of the main advantages of ldrex, strex is that the exclusive monitor that they set on the address can be read by external peripherals, disabling interrupts may not help in this case so it may be best to not use an interrupt disabled version here.

The way I read the docs, the exclusive monitor bit can only be reset during this sequence if an exception occurs. In other words, a peripheral might update the memory location making the operation non-atomic (but I don't see a way to make it atomic). Thus it sounds to me like (for example) __sync_fetch_and_add_4 can only be used on normal memory, not memory in the peripheral address space. Interrupts disabled or not won't make a difference.
Source:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABHCIHB.html

aykevl updated this revision to Diff 197605.May 1 2019, 11:39 AM
  • now _really_ update the URL

There are really two intended uses for the __sync_* builtins on ARM.

  1. Allow pre-Thumb2 ARMv6 code in Thumb mode to use atomics which are natively supported, but can't be emitted inline (because the instructions only exist in ARM mode).
  2. On ARM Linux, there's a special kernel-assisted sequence that can be used to implement these builtins for all ARM targets, even ones which don't have native atomics. (compiler-rt currently doesn't support this, but libgcc does).

Neither of these is relevant for M-class processors.

In other cases, the compiler should emit an inline sequence, or call one of the __atomic_* APIs. See http://llvm.org/docs/Atomics.html#libcalls-sync . If that isn't happening, it's a bug in clang/LLVM.

The open question here, is whether it's appropriate to turn off interrupts in compiler-rt, if we're in a baremetal environment. I would lean towards no... if someone really wants to use an implementation that turns off interrupts, they can implement __atomic_* appropriately.

disabling interrupts with cpsid is specific to M class

The instruction exists on A-class targets. It's just that on most A-class targets, we run in userspace, and userspace isn't allowed to disable interrupts, for obvious reasons. (Also, on A-class targets, we can't generally assume there's only one CPU core.)

The instruction exists on A-class targets. It's just that on most A-class targets, we run in userspace, and userspace isn't allowed to disable interrupts, for obvious reasons.

M-profile also has two privilege levels, and the CPS instruction is not usable in unprivileged mode, so disabling interrupts won't always work (and the CPS instruction is a no-op in unprivileged mode, so we would silently get non-atomic accesses).

aykevl abandoned this revision.Jun 9 2019, 4:01 AM

Ok, based on the above comments I believe that this is a bug in LLVM and not a missing feature in compiler-rt. LLVM should generate __atomic_* calls instead of __sync_ calls and it shouldn't generate those for Cortex-M3 and up when doing 32-bit atomics because it should be able to emit such instructions inline.

Unfortunately, I know very little about the backend to fix this myself.