This is an archive of the discontinued LLVM Phabricator instance.

[COFF, ARM64] Add _InterlockedCompareExchangePointer_nf intrinsic
ClosedPublic

Authored by mgrang on Oct 2 2018, 4:01 PM.

Diff Detail

Event Timeline

mgrang created this revision.Oct 2 2018, 4:01 PM
rnk added inline comments.Oct 2 2018, 4:18 PM
lib/CodeGen/CGBuiltin.cpp
3003

Is the no fence version really equivalent to this seq_cst version here?

mgrang added inline comments.Oct 3 2018, 10:59 AM
lib/CodeGen/CGBuiltin.cpp
3003

I don't see InterlockedCompareExchangePointer creating a fence but it should. (since it is the fence version). So maybe that needs to be fixed (and possibly other fence functions).

InterlockedCompareExchangePointer_nf should not create a fence so the current implementation seems correct.

efriedma added inline comments.
lib/CodeGen/CGBuiltin.cpp
3003

"create a fence" is a little confusing because the AArch64 ldaxr/stlxr have builtin fences... but presumably the "nf" version should use ldxr/stxr instead. At the IR level, that corresponds to "monotonic".

rnk added inline comments.Oct 4 2018, 1:47 PM
lib/CodeGen/CGBuiltin.cpp
3003

Hm, if we're supposed to have fences, we're probably missing them everywhere. I thought the atomicrmw orderings were enough, but that shows what I know about the C++ memory model. =p

We don't generate a fence for this intrinsic when MSVC does:

unsigned char test_arm(long *base, long idx) {
  return _interlockedbittestandreset_acq(base, idx);
}

@efriedma, is MSVC over-emitting extra fences, or are we under-emitting? If this is their "bug", should we be bug-for-bug compatible with it so we get the same observable memory states?

mgrang added a comment.Oct 4 2018, 2:01 PM

MSVC seems to generate ldaxr/stlxr (as well as dmb) for InterlockedCompareExchangePointer and ldxr/stxr for InterlockedCompareExchangePointer_nf.

void *test_InterlockedCompareExchangePointer(void * volatile *Destination,
                                             void *Exchange, void *Comparand) {
  return _InterlockedCompareExchangePointer(Destination, Exchange, Comparand);
}

test_InterlockedCompareExchangePointer:
      58:       ff 83 00 d1     sub     sp, sp, #32
      5c:       e0 0b 00 f9     str     x0, [sp, #16]
      60:       e1 07 00 f9     str     x1, [sp, #8]
      64:       e2 03 00 f9     str     x2, [sp]
      68:       ec 03 40 f9     ldr     x12, [sp]
      6c:       ec 03 0c aa     mov     x12, x12
      70:       eb 07 40 f9     ldr     x11, [sp, #8]
      74:       eb 03 0b aa     mov     x11, x11
      78:       ea 0b 40 f9     ldr     x10, [sp, #16]
      7c:       ea 03 0a aa     mov     x10, x10

$LN3:
      80:       49 fd 5f c8     ldaxr   x9, [x10]
      84:       e9 03 09 aa     mov     x9, x9
      88:       3f 01 0c eb     cmp     x9, x12
      8c:       81 00 00 54     b.ne    #16 <$LN4>
      90:       4b fd 08 c8     stlxr   w8, x11, [x10]
      94:       1f 01 00 71     cmp     w8, #0
      98:       41 ff ff 54     b.ne    #-24 <$LN3>

$LN4:
      9c:       bf 3b 03 d5     dmb     ish
      a0:       e9 0f 00 f9     str     x9, [sp, #24]
      a4:       e0 0f 40 f9     ldr     x0, [sp, #24]
      a8:       ff 83 00 91     add     sp, sp, #32
      ac:       c0 03 5f d6     ret
void *test_InterlockedCompareExchangePointer_nf(void * volatile *Destination,
                                             void *Exchange, void *Comparand) {
  return _InterlockedCompareExchangePointer_nf(Destination, Exchange, Comparand);
}

test_InterlockedCompareExchangePointer_nf:
       0:       ff 83 00 d1     sub     sp, sp, #32
       4:       e0 0b 00 f9     str     x0, [sp, #16]
       8:       e1 07 00 f9     str     x1, [sp, #8]
       c:       e2 03 00 f9     str     x2, [sp]
      10:       ec 03 40 f9     ldr     x12, [sp]
      14:       ec 03 0c aa     mov     x12, x12
      18:       eb 07 40 f9     ldr     x11, [sp, #8]
      1c:       eb 03 0b aa     mov     x11, x11
      20:       ea 0b 40 f9     ldr     x10, [sp, #16]
      24:       ea 03 0a aa     mov     x10, x10

$LN3:
      28:       49 7d 5f c8     ldxr    x9, [x10]
      2c:       e9 03 09 aa     mov     x9, x9
      30:       3f 01 0c eb     cmp     x9, x12
      34:       81 00 00 54     b.ne    #16 <$LN4>
      38:       4b 7d 08 c8     stxr    w8, x11, [x10]
      3c:       1f 01 00 71     cmp     w8, #0
      40:       41 ff ff 54     b.ne    #-24 <$LN3>

$LN4:
      44:       e9 0f 00 f9     str     x9, [sp, #24]
      48:       e0 0f 40 f9     ldr     x0, [sp, #24]
      4c:       ff 83 00 91     add     sp, sp, #32
      50:       c0 03 5f d6     ret
      54:       00 00 00 00  <unknown>
mgrang added a comment.Oct 4 2018, 2:05 PM

AArch64 Psuedo Expansion does not seem to honor AtomicOrdering types (SequentiallyConsistent/Monotonic). It seems to always generate LDAXRX/STLXRX for 64-bit Cmp_Xchg.

How are you testing?

You might see different behavior at -O0.

efriedma added inline comments.Oct 4 2018, 3:27 PM
lib/CodeGen/CGBuiltin.cpp
3003

In terms of emitting fences, maybe we are actually missing a fence.

An ldaxr+stlxr pair isn't a complete fence, at least in theory; it stops loads from being hosted past it, and stores from being sunk past it, but stores can be hoisted and loads can be sunk. That said, a stronger fence isn't necessary to model C++11 atomics; C++11 only requires sequential consistency with other sequentially consistent operations, plus an acquire/release barrier. And a program that isn't using C++11 relaxed atomics can't depend on a stronger barrier without triggering a data race.

A program written before C++11 atomics were generally available could theoretically depend on the barrier through a series of operations that cause a data race under the C++11 rules. Since there were no clear rules before C++11, programs did things like emulate relaxed atomics on top of volatile pointers; using such a construct, a program could depend on the implied barrier. For this reason, gcc emits a fence on AArch64 after __sync_* (but not __atomic_*). And I guess MSVC does something similar for _Interlocked*.

That said, LLVM has never emitted a dmb for __sync_* operations on AArch64, and it hasn't led to any practical problems as far as I know.

mgrang updated this revision to Diff 168403.Oct 4 2018, 4:03 PM

How are you testing?

You might see different behavior at -O0.

Yes, I was testing at -O0. At -O2 I can see monotonic ordering affect the choice of LD/ST insts used.

This revision is now accepted and ready to land.Oct 5 2018, 12:03 PM
This revision was automatically updated to reflect the committed changes.
efriedma added inline comments.Oct 5 2018, 12:52 PM
lib/CodeGen/CGBuiltin.cpp
3003

it stops loads from being hosted past it, and stores from being sunk past it, but stores can be hoisted and loads can be sunk.

Looking over it again, this bit is a really bad description of what actually happens. Really there are two issues: one, an operation from after an ldaxr/stlxr pair could end up betwen the ldaxr and stlxr, and two, a cmpxchg doesn't always execute the stlxr.