Details
Diff Detail
Event Timeline
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 | Is the no fence version really equivalent to this seq_cst version here? |
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 | I don't see InterlockedCompareExchangePointer creating a fence but it should. (since it is the fence version). So maybe that needs to be fixed (and possibly other fence functions). InterlockedCompareExchangePointer_nf should not create a fence so the current implementation seems correct. |
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 | "create a fence" is a little confusing because the AArch64 ldaxr/stlxr have builtin fences... but presumably the "nf" version should use ldxr/stxr instead. At the IR level, that corresponds to "monotonic". |
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 | Hm, if we're supposed to have fences, we're probably missing them everywhere. I thought the atomicrmw orderings were enough, but that shows what I know about the C++ memory model. =p We don't generate a fence for this intrinsic when MSVC does: unsigned char test_arm(long *base, long idx) { return _interlockedbittestandreset_acq(base, idx); } @efriedma, is MSVC over-emitting extra fences, or are we under-emitting? If this is their "bug", should we be bug-for-bug compatible with it so we get the same observable memory states? |
MSVC seems to generate ldaxr/stlxr (as well as dmb) for InterlockedCompareExchangePointer and ldxr/stxr for InterlockedCompareExchangePointer_nf.
void *test_InterlockedCompareExchangePointer(void * volatile *Destination, void *Exchange, void *Comparand) { return _InterlockedCompareExchangePointer(Destination, Exchange, Comparand); } test_InterlockedCompareExchangePointer: 58: ff 83 00 d1 sub sp, sp, #32 5c: e0 0b 00 f9 str x0, [sp, #16] 60: e1 07 00 f9 str x1, [sp, #8] 64: e2 03 00 f9 str x2, [sp] 68: ec 03 40 f9 ldr x12, [sp] 6c: ec 03 0c aa mov x12, x12 70: eb 07 40 f9 ldr x11, [sp, #8] 74: eb 03 0b aa mov x11, x11 78: ea 0b 40 f9 ldr x10, [sp, #16] 7c: ea 03 0a aa mov x10, x10 $LN3: 80: 49 fd 5f c8 ldaxr x9, [x10] 84: e9 03 09 aa mov x9, x9 88: 3f 01 0c eb cmp x9, x12 8c: 81 00 00 54 b.ne #16 <$LN4> 90: 4b fd 08 c8 stlxr w8, x11, [x10] 94: 1f 01 00 71 cmp w8, #0 98: 41 ff ff 54 b.ne #-24 <$LN3> $LN4: 9c: bf 3b 03 d5 dmb ish a0: e9 0f 00 f9 str x9, [sp, #24] a4: e0 0f 40 f9 ldr x0, [sp, #24] a8: ff 83 00 91 add sp, sp, #32 ac: c0 03 5f d6 ret
void *test_InterlockedCompareExchangePointer_nf(void * volatile *Destination, void *Exchange, void *Comparand) { return _InterlockedCompareExchangePointer_nf(Destination, Exchange, Comparand); } test_InterlockedCompareExchangePointer_nf: 0: ff 83 00 d1 sub sp, sp, #32 4: e0 0b 00 f9 str x0, [sp, #16] 8: e1 07 00 f9 str x1, [sp, #8] c: e2 03 00 f9 str x2, [sp] 10: ec 03 40 f9 ldr x12, [sp] 14: ec 03 0c aa mov x12, x12 18: eb 07 40 f9 ldr x11, [sp, #8] 1c: eb 03 0b aa mov x11, x11 20: ea 0b 40 f9 ldr x10, [sp, #16] 24: ea 03 0a aa mov x10, x10 $LN3: 28: 49 7d 5f c8 ldxr x9, [x10] 2c: e9 03 09 aa mov x9, x9 30: 3f 01 0c eb cmp x9, x12 34: 81 00 00 54 b.ne #16 <$LN4> 38: 4b 7d 08 c8 stxr w8, x11, [x10] 3c: 1f 01 00 71 cmp w8, #0 40: 41 ff ff 54 b.ne #-24 <$LN3> $LN4: 44: e9 0f 00 f9 str x9, [sp, #24] 48: e0 0f 40 f9 ldr x0, [sp, #24] 4c: ff 83 00 91 add sp, sp, #32 50: c0 03 5f d6 ret 54: 00 00 00 00 <unknown>
AArch64 Psuedo Expansion does not seem to honor AtomicOrdering types (SequentiallyConsistent/Monotonic). It seems to always generate LDAXRX/STLXRX for 64-bit Cmp_Xchg.
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 | In terms of emitting fences, maybe we are actually missing a fence. An ldaxr+stlxr pair isn't a complete fence, at least in theory; it stops loads from being hosted past it, and stores from being sunk past it, but stores can be hoisted and loads can be sunk. That said, a stronger fence isn't necessary to model C++11 atomics; C++11 only requires sequential consistency with other sequentially consistent operations, plus an acquire/release barrier. And a program that isn't using C++11 relaxed atomics can't depend on a stronger barrier without triggering a data race. A program written before C++11 atomics were generally available could theoretically depend on the barrier through a series of operations that cause a data race under the C++11 rules. Since there were no clear rules before C++11, programs did things like emulate relaxed atomics on top of volatile pointers; using such a construct, a program could depend on the implied barrier. For this reason, gcc emits a fence on AArch64 after __sync_* (but not __atomic_*). And I guess MSVC does something similar for _Interlocked*. That said, LLVM has never emitted a dmb for __sync_* operations on AArch64, and it hasn't led to any practical problems as far as I know. |
Yes, I was testing at -O0. At -O2 I can see monotonic ordering affect the choice of LD/ST insts used.
lib/CodeGen/CGBuiltin.cpp | ||
---|---|---|
3003 |
Looking over it again, this bit is a really bad description of what actually happens. Really there are two issues: one, an operation from after an ldaxr/stlxr pair could end up betwen the ldaxr and stlxr, and two, a cmpxchg doesn't always execute the stlxr. |
Is the no fence version really equivalent to this seq_cst version here?