This is an archive of the discontinued LLVM Phabricator instance.

[atomics] Fix runtime calls for misaligned atomics
ClosedPublic

Authored by t.p.northover on Apr 5 2018, 6:36 AM.

Details

Summary

The __atomic_whatever functions in compiler-rt were only looking at the size of their argument when deciding whether the implementation should be lock-free. This is incorrect:

  • On x86 simple loads and stores are not atomic if the address is misaligned. They could use cmpxchg to implement loads & stores instead, but I believe this would be incompatible with GCC's ABI (and it is an ABI choice because you obviously can't mix them).
  • On other platforms (ARM that I know of) misaligned atomic accesses will simply fault (as well as the above).

So this patch falls back to locks for the misaligned case. It also adds the single-byte case to the generic functions since even though Clang won't make calls to them, they should probably work properly if anyone does.

Diff Detail

Event Timeline

t.p.northover created this revision.Apr 5 2018, 6:36 AM
Herald added subscribers: Restricted Project, kristof.beyls, mcrosier. · View Herald TranscriptApr 5 2018, 6:36 AM
dvyukov added a subscriber: dvyukov.Apr 5 2018, 6:55 AM

How how can atomic objects end up being mis-aligned? Isn't it UB already?

Atomics accessed via C11 _Atomic and C++11 std::atomic will be suitably aligned, but there's a reasonable amount of legacy code that uses the GCC builtins on non-atomic types (of unknown alignment) and this is what Clang uses to implement those accesses when they come up. Also, on the LLVM side there are even fewer restrictions and load atomic i32, i32* %ptr monotonic, align 1 is perfectly valid IR that gets lowered to these calls.

In parallel I'm trying to add a performance warning for when Clang hits this issue and is forced to generate libcalls.

Atomics accessed via C11 _Atomic and C++11 std::atomic will be suitably aligned, but there's a reasonable amount of legacy code that uses the GCC builtins on non-atomic types (of unknown alignment) and this is what Clang uses to implement those accesses when they come up.

This is illegal, right?
Even non-atomic accesses can fail and be miscompiled for unaligned variables, no?
For example, I would assume that "(uintptr_t)ptr % 4 == 0" where ptr come from an unmarked int will be evaluated to a compile-time true.

Also, on the LLVM side there are even fewer restrictions and load atomic i32, i32* %ptr monotonic, align 1 is perfectly valid IR that gets lowered to these calls.

I've just checked the LangRef and realised this is false. LLVM does make such loads & stores UB, though I still think the Clang use is valid (if unfortunate).

Even non-atomic accesses can fail and be miscompiled for unaligned variables, no?

For example, I would assume that "(uintptr_t)ptr % 4 == 0" where ptr come from an unmarked int will be evaluated to a compile-time true.

For an unmarked int that's true, but there are ways to get integer types with lowered alignment requirements. Mostly revolving around applying some kind of packed attribute or pragma to either a struct or the type directly.

I've just checked the LangRef and realised this is false. LLVM does make such loads & stores UB, though I still think the Clang use is valid (if unfortunate).

Bug won't clang lower these access to the exactly that IR that LLVM declares as UB?

Bug won't clang lower these access to the exactly that IR that LLVM declares as UB?

No, it explicitly checks alignment to decide whether to make a libcall itself.

Also, the LLVM situation is even more nuanced: after r320243 I think the LangRef is out of date and misaligned ones can't be called UB; fortunately, it also seems to do the right thing.

No, it explicitly checks alignment to decide whether to make a libcall itself.

Ack. Thanks for explaining.

Yes, the situation around atomic instructions in IR is a bit of a mess. Originally, the backend was required to reject instructions which wouldn't be lock-free, and all the documentation with that expectation. But that changed later to start generating calls to __atomic_* instead, and I guess the documentation was never completely updated.

clang might also have a bug here? I think I've seen it generate calls to __atomic_load_8 with a misaligned operand.

compiler-rt/lib/builtins/atomic.c
168–169

We also need to fix this FIXME, for correctness... if we have a 16-byte atomic store implementation, we need to use it.

t.p.northover added inline comments.Apr 6 2018, 9:25 AM
compiler-rt/lib/builtins/atomic.c
168–169

Any chance I could skip that for now?

It's an order of magnitude harder than fixing the misalignment problems since LLVM already generates a mixture of libcalls and cmpxchg based on the CPU for x86, which forces this to be a runtime CPUID check.

efriedma added inline comments.Apr 6 2018, 10:59 AM
compiler-rt/lib/builtins/atomic.c
168–169

Can you at least fix it for targets where 16-byte atomics are always lock-free, like aarch64? (Of course I don't expect you to implement the x86 cpuid bits.)

t.p.northover added inline comments.Apr 6 2018, 11:15 AM
compiler-rt/lib/builtins/atomic.c
168–169

Sounds like a reasonable compromise. This check is easy enough to get right (I can use defined(__SIZEOF_INT128__) as a proxy for whether it's supported).

I don't suppose you have any ideas about the IS_LOCK_FREE_16 check though? I had a bit of a think earlier on and it's harder than it looks. The best I've come up with so far is:

#define IS_LOCK_FREE_16(ptr) (__builtin_constant_p(__c11_atomic_is_lock_free(16)) && __c11_atomic_is_lock_free(16) && (uintptr_t)ptr % 16 == 0)

The constant check is there because otherwise we get a realised (and unresolved) call to __atomic_is_lock_free(16, 0) when Clang doesn't know -- and on x86 this would, of course, be CPUID based if implemented so I can't in good conscience make Clang decide it's 0.

__atomic_always_lock_free(16,0)?

Updated so that 16-byte atomics work when the instruction is always available. Good idea Eli.

efriedma accepted this revision.Apr 9 2018, 12:15 PM

LGTM, but maybe wait a couple days to see if @compnerd wants to comment.

This revision is now accepted and ready to land.Apr 9 2018, 12:15 PM

Sorry to update the patch after you've reviewed it, but I'm afraid as a result of https://llvm.org/PR34347 it's turned out that misaligned x86 atomics actually need the lock-free implementation, which makes this even messier than I thought.

It adds 3 problems:

  1. As far as I know there's no way Clang communicates this, in fact I don't think it's even aware of it. So the updated patch has bare x86_64 and i386 checks.
  2. There is no way to make Clang inline a misaligned atomic operation that I could find, and even if it could the LLVM IR would be dubious (the LangRef is sprinkled with warnings about such things being undefined and they get converted back to libcalls anyway).
  3. x86 movs aren't atomic when misaligned, so load & store need a separate path for that case.

Because of the second, misaligned pointers in the new patch go down an aligned codepath with all the technical UB implications you might fear. It works, though. As far as I can tell this fixes everything except 128-bit x86 atomics in practice.

The Linux libatomic __atomic_is_lock_free returns false for unaligned pointers, even on x86. clang must generate code which is compatible with that, so it *cannot* inline misaligned atomic operations. Given that clang can't inline misaligned atomic operations anyway, I don't see any compelling reason for compiler-rt to try to lower misaligned atomics using lock-free operations.

Not that's it's really relevant, but for the testcase in https://bugzilla.redhat.com/show_bug.cgi?id=1565766#c4 , clang should assume the pointer is aligned: it's undefined behavior to produce an int* which isn't 4-byte aligned.

The Linux libatomic __atomic_is_lock_free returns false for unaligned pointers, even on x86. clang must generate code which is compatible with that, so it *cannot* inline misaligned atomic operations.

I agree we need to be compatible, but I don't see that:

#include <stdlib.h>
#include <stdio.h>

int main() {
  char *mem = malloc(16);

  for (int i = 1; i <= 64; i <<= 1)
    printf("__atomic_is_lock_free(%d, %p) = %d\n", i, mem+1, __atomic_is_lock_free(i, mem+1));
   free(mem);
}

tells me that everything <= 4 is lock-free on x86_64 Linux (I have no idea why 4). GCC also does inline them (and I see no special misaligned libatomic implementation except for __atomic_load and __atomic_store which take a lock).

Basically, I think GCC is in as bad shape as we are but in practice lock-free implementations are what gets emitted. At least we can change our implementation in the future since we do emit a libcall for misaligned operations.

I don't see any compelling reason for compiler-rt to try to lower misaligned atomics using lock-free operations.

My best argument is still that I think it's the maximally compatible way to proceed. It is ugly though.

efriedma added inline comments.May 16 2018, 3:50 PM
compiler-rt/lib/builtins/atomic.c
184

Isn't the return value of __c11_atomic_compare_exchange_weak the success boolean?

Even with that fixed, though, this is UB because you're violating the alignment rules for _Atomic. It might appear to emit the right code for now, but it's a ticking time bomb because the compiler could optimize the "cmpxchg" to a "mov" (since you're not actually modifying the memory on success).

The "right" way to do this is to use __atomic_load on a pointer to an unligned type. Granted, clang currently doesn't lower that to the sequence you want (instead it generates a libcall to __atomic_load_4).

efriedma added inline comments.May 16 2018, 3:58 PM
compiler-rt/lib/builtins/atomic.c
184

(Or, of course, you could implement this with inline assembly, if you can't get the compiler to emit the sequence you want.)

Sorry I take so long replying to this each time. I really need to plan my time better.

compiler-rt/lib/builtins/atomic.c
184

Isn't the return value of __c11_atomic_compare_exchange_weak the success boolean?

Oops, yes. I'll fix that.

Even with that fixed, though, this is UB because you're violating the alignment rules for _Atomic.

Oh yes, it's definitely really dodgy. As part of the implementation I think compiler-rt gets some kind of latitude to know about Clang's implementation details there though. It's impossible to implement std::vector without UB for example, but libc++ keeps chugging along anyway.

It might appear to emit the right code for now, but it's a ticking time bomb because the compiler could optimize the "cmpxchg" to a "mov" (since you're not actually modifying the memory on success).

I don't think it could (ignoring the whole UB => nasal demons thing). The mov would not do an atomic load, which would render the return value potentially invalid. A cmpxchg still needs to return a valid previous value even if memory is not modified.

It could potentially modify it to an atomic load and then back into a call to __atomic_load_4 though (assuming it inserted the correct barriers, which might not be possible for a cmpxchg release).

The "right" way to do this is to use __atomic_load on a pointer to an unligned type. Granted, clang currently doesn't lower that to the sequence you want (instead it generates a libcall to __atomic_load_4).

Yes. And even if there was a way to get Clang to generate the desired IR LLVM doesn't do the right thing with it either (it calls __atomic_load).

Inline assembly is an option, but even for just x86 it's significantly uglier because of the whole amd64 vs x86, cmpxchg16b thing.

Fixing cmpxchg implementation of load to return correct value.

Did a bit more research into the GNU libatomic behavior. As far as I can tell, my initial assessment was correct: libatomic never uses unaligned atomic operations. However, it uses a trick which makes this a little complicated to test: it promotes small atomic operations to pointer size before performing the alignment check. So, for example, libatomic supports lock-free operations with size=3, depending on the input pointer. (On x86, it still only promotes up to pointer size, even though larger lock-free atomics are available; not sure why.)

compiler-rt/lib/builtins/atomic.c
184

The mov would not do an atomic load, which would render the return value potentially invalid. A cmpxchg still needs to return a valid previous value even if memory is not modified.

The normal lowering for an atomic load from an aligned pointer on x86 is "mov"; not sure what you're getting at here.

even for just x86 it's significantly uglier because of the whole amd64 vs x86, cmpxchg16b thing.

x86 doesn't support unaligned 16-byte atomic operations.

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJan 12 2020, 4:41 PM
Herald added a subscriber: dexonsmith. · View Herald Transcript
lkail added a subscriber: lkail.Feb 8 2022, 6:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptSun, Jan 21, 4:30 PM
Herald added a subscriber: Enna1. · View Herald Transcript