This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Pretend atomics are always lock-free for small widths.
Needs ReviewPublic

Authored by efriedma on Nov 14 2022, 1:42 PM.

Details

Summary

Trying to accurately model what the hardware actually supports seems to lead to a lot of people complaining, and nobody saying it's actually helpful. So just pretend everything is lock-free, and let users deal with ensuring that the __sync_* routines are actually lock-free. If anyone complains, we can just say "gcc does the same thing".

Partially reverts D120026. Makes D130480 unnecessary.

Fixes https://github.com/llvm/llvm-project/issues/58603

Diff Detail

Event Timeline

efriedma created this revision.Nov 14 2022, 1:42 PM
efriedma requested review of this revision.Nov 14 2022, 1:42 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 14 2022, 1:42 PM

Looking at GCC it looks like there (for cortex-m0 at least) the behaviour is that loads and stores are generated inline, but more complex operations go to the atomic library calls (not the sync library calls). e.g. for

int x, y;
int fn() {
  return __atomic_load_n(&x, __ATOMIC_SEQ_CST);
}
int fn2() {
  return __atomic_compare_exchange_n(&x, &y, 0, 0, 0, __ATOMIC_SEQ_CST);
}

I get with arm-none-eabi-gcc tmp.c -O1 -mcpu=cortex-m0

fn:
        ldr     r3, .L2
        dmb     ish
        ldr     r0, [r3]
        dmb     ish
        bx      lr

fn2:
        push    {lr}
        sub     sp, sp, #12
        ldr     r0, .L5
        adds    r1, r0, #4
        movs    r3, #5
        str     r3, [sp]
        movs    r3, #0
        movs    r2, #0
        bl      __atomic_compare_exchange_4
        add     sp, sp, #12
        pop     {pc}

so if we're doing this for compatibility with GCC we should do the same.

So gcc has two different behaviors on ARM:

  • On linux, prefers __sync calls, and generates inline code for load/store.
  • On baremetal, gcc chooses what sort of atomic call to generate based on how the source code is written: if the user writes __sync, you get __sync, and if the user writes __atomic, the user gets __atomic. But it generates inline code for load/store, so it's assuming the __atomic implementation is lock-free.

We'd have to hack clang IR generation to generate different IR for the two constructs. I'm not sure what the underlying logic is, or if it's worth trying to emulate.

Any further comment on this?