Fixes #48236
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Preferably this should also include the implementation for atomic RMW/CAS instructions to prove that this lowering is legal. If native or at least kernel-supported CAS is not available, then atomic load/store needs to use libatomic (possibly subtarget dependent).
It would also be good to include relevant quotes from the ISA manual -- atomicity of load/store is usually a given, but do they also guarantee a seq_cst ordering without a memory barrier?
(Disclaimer: I'm not familiar with m68k, just covering the usual atomic lowering legality questions.)
Agree with @nikic . I think it will be better for this patch to have support for CmpXchg using CAS (for 68020 and later) and fallback to library solution for older CPUs.
Thanks for replying !
To be honest, I can find any word regarding the atomic ordering in the document. I can't find any memory barrier instruction either. I just follow what gcc does.
There is a discussion here: https://github.com/M680x0/M680x0-mono-repo/issues/13
Are multi-processor m68k computers a thing? I can't find any reference to such a thing existing, but the manual indicates that the processor was designed to allow it. If it does exist, m68k probably needs to use sequences similar to x86. (x86 didn't have any barrier instruction for a long time, but a "lock" instruction has the right semantics.)
Take my word with a pinch of salt
We may have to use the compare-and-swap instruction to reach that (or with the aid of glibc). But is it necessary on such an old architecture that may not have all the modern technologies such as out-of-order execution and store/write buffer --- which are the reasons the atomic instruction or memory barrier exists for ?
I can't remember any details (so long ago, sniff) - but wasn't TAS (test and set) used as an alternative to CAS which only appeared on 68020 and later? I vaguely remember on Amigas, which although not truly multi-processor had a lot of custom chips accessing some of the memory at the same time, you had to be very careful about using TAS/CAS.
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | Also - do we need to add an pipeline.ll test file? |
Store buffers existed in the era of the m68k, I think. OOO was mostly later, though.
That said, we shouldn't try to theorycraft the semantics of a system that doesn't actually exist. I'm okay with assuming we have a single processor if the rest of the ecosystem makes the same assumption (which means we only have to care about interrupts, not memory reordering).
I dig into the libatomic.a, here is part of the result:
00000000 <__atomic_store_4>: 0: 206f 0004 moveal %sp@(4),%a0 4: 20af 0008 movel %sp@(8),%a0@ 8: 4e75 rts
gcc also compiles to the same result:
$ echo 'void foo(int *x, int y) { __atomic_store_4(x, y, __ATOMIC_SEQ_CST); }' | m68k-linux-gnu-gcc -m68040 -x c - -o - -S -O2 #NO_APP .file "<stdin>" .text .align 2 .globl foo .type foo, @function foo: move.l 4(%sp),%a0 move.l 8(%sp),(%a0) rts .size foo, .-foo .ident "GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0" .section .note.GNU-stack,"",@progbits
Maybe the normal load / store is sufficient to match the atomic ordering semantic on m68k ?
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | What is pipeline.ll ? |
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll Different backends do it differently, but basically its a test file that checks what passes have been run at all/some opt levels |
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | Thanks. This looks interesting. As far as I'm concerned, there is no reason not to add this one in M68k. |
address feedbacks
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
---|---|---|
11 | Is there any example I can refer to ? |
I think it makes sense to assume m68k always runs on uniprocessors. GCC / libgcc and m68k port of Linux also makes the same assumption. For instance, gcc simply lowers atomic fence into asm volatile("" ::: "memory") (meaning the only thing we need to do is preventing compiler from reordering across the fence).
Though I don't think the changes on getMaxAtomicSizeInBitsSupported can really be justified. Given the fact that you can lower unsupported atomic operations to library calls in legalizer as well.
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | I think you tried to turn every atomic operations but atomic_load / store / cmpxchg into libcall here. But even we don't turn them into libcalls in this pass, we still can do that during legalization by marking the corresponding SDNode as Expand or LibCall, right? |
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
28 | Why do we need to wrap these patterns with multiclass? | |
llvm/test/MC/Disassembler/M68k/atomics.txt | ||
4 | what about other sizes? | |
llvm/test/MC/M68k/Atomics/cas.s | ||
6 | ditto other sizes |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | *corresponding operation as Expand or LibCall |
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
---|---|---|
28 | Probably the best way to make this 020+ only is to wrap the patterns, something like: let Predicates = [FeatureISA20] { foreach size = [8, 16, 32] in { .... } } |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | Marking it as LibCall can stop the type legalizer from expanding the AtomicStore with 64bits operand to AtomicSwap, which will be transformed to __sync_lock_test_and_set --- a function that I can't find in libatomic.a. We may need to find a way to stop the type legalizer from doing it so that the legalizer can expand AtomicStore to library call. Any ideas ? |
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
28 | Thanks. I'll look into it. |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | Here is where the undesirable expansion happen : |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | *Marking it as LibCall CANNOT* |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | We may want to set the ATOMIC_CMP_SWAP as LibCall when the target is below 020. But that also turns into __sync_* library call. Here is my workaround: You can see from here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AtomicExpandPass.cpp#L178 that the AtomicExpandPass will check the alignment to decide whether should give it a go. We can add a IR pass that change the alignment of cmpxchg to zero so that the AtomicExpandPass will work for us. Do you think this is auspicious ? |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | Another solution is mark the ATOMIC_CMP_SWAP as Custom and lower to __atomic_compare_swap_* by ourself. Which one do you prefer ? |
llvm/lib/CodeGen/AtomicExpandPass.cpp | ||
---|---|---|
176 ↗ | (On Diff #471716) | Pros of the first solution: It is easier to implement. Cons of the first solution: It's not that orthodox. Pros of the second: it follows the tradition. Cons of the second: We have to consider not only cas but also rmw instruction. It may require an effort. |
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | So do we need to add TargetPassConfig::addIRPasses() to this call now? |
The atomic width is never a property of a specific instruction. Either *all* atomic ops of a given width are lock-free, or *all* atomic ops of a given width need to be transformed into __atomic_* libcalls.
So the right thing to do is something like the following, in M68kTargetLowering::M68kTargetLowering():
if (Subtarget.atLeastM68020()) setMaxAtomicSizeInBitsSupported(32); else setMaxAtomicSizeInBitsSupported(0);
If you need to emulate rmw ops on top of cmpxchg, implementing shouldExpandAtomicRMWInIR will tell AtomicExpandPass to handle that expansion for you.
It's possible to emulate a lock-free atomic cmpxchg on a uniprocessor, for example, https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt . But the compiler can't do it itself; it needs some sort of operating system assistance. And as far as I know, Linux requires an 68020 anyway, so I don't see a reason for you to go down that path.
I have a few questions:
- So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
- Why the atomic width is not a property of a specific instruction ?
Thanks !
llvm/lib/Target/M68k/M68kTargetMachine.cpp | ||
---|---|---|
161 | Oh, I forgot it. I will address this later. Thanks ! |
one more question :
- is it legal to transform the atomic_load / atomic_store to a normal load / store before AtomicExpandPass ? Does that break any property that belongs to atomic instruction ?
Let me try to answer most of the questions at once. But first, here is the workflow I would do:
- For 68020 and later, setMaxAtomicSizeInBitsSupported(32) and setMaxAtomicSizeInBitsSupported(0) otherwise. AtomicExpandPass then replaces everything not in the size ranges with __atomic_*.
- AtomicLoad, AtomicStore and AtomicCmpXchg in target >= 68020 will be lowered to native instructions.
- Mark every other ISD::ATOMIC_* as LibCall, this effectively lowers them into library calls to __sync_*.
Now here are the questions:
Marking it as LibCall can stop the type legalizer from expanding the AtomicStore with 64bits operand to AtomicSwap,
I'm not sure how did you get this, given the fact that 64-bit AtomicStore should already been replaced by __atomic_store by AtomicExpandPass. So you don't need to mark ISD::ATOMIC_STORE as LibCall.
transformed to __sync_lock_test_and_set --- a function that I can't find in libatomic.a.
Correct, because __sync_* come from libgcc, the primary compiler runtime library for GNU toolchain. In the case of m68k, these __sync_* functions are "basically" lock-free -- they simply make a syscall that provides the cmpxchg feature (though that also means it's less portable than, said libatomic which is a pure C implementation). To my best understanding, the reason legalizer chooses to lower atomic operations into __sync_* rather than __atomic_* is because the former are more lower level and more importantly, libgcc is always available during the linking (when using GNU driver of course, which is the case for m68k-linux-gnu).
So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
Why the atomic width is not a property of a specific instruction ?
To my best understanding, LLVM makes the decision to segregate native atomic supports by size, rather than a specific instruction, simply because most architectures do that. You either have full or no atomic support for a certain size. Meaning, m68k, as an anomaly, needs to do some extra works on this matter, which is not difficult because I believe the workflow I put at the beginning of this comment can lower atomic load / store to native instructions (on supported sub architectures) while lowering other atomic operations to libcalls for a given size.
So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
Why the atomic width is not a property of a specific instruction ?To my best understanding, LLVM makes the decision to segregate native atomic supports by size, rather than a specific instruction, simply because most architectures do that. You either have full or no atomic support for a certain size. Meaning, m68k, as an anomaly, needs to do some extra works on this matter, which is not difficult because I believe the workflow I put at the beginning of this comment can lower atomic load / store to native instructions (on supported sub architectures) while lowering other atomic operations to libcalls for a given size.
The __atomic_* libcalls are allowed to be implemented using a lock. If you try to mix in lock-free load and store operations, they won't respect that lock.
If m68k has __sync_* in libgcc, those are lock-free; you can use them alongside plain load/store instructions.
https://llvm.org/docs/Atomics.html has more details on various forms of calls.
Thanks for all of your edifying comments. The path is getting clearer. Here is my understanding. Correct me if I'm wrong.
- __atomic_* is allowed to use lock, whilst __sync_* is lock-free.
- Non-lock-free function is NOT allowed to use simultaneously with the lock-free instruction or function, otherwise the lock may not be respected (i.e. race condition may occur)
- atomic_compare_and_swap needs to be transformed to __atomic_* if we don't have CAS support.
Conclusion 1: According to 1 & 2 & 3, if we don't have CAS support, we have to transform the atomic_load / atomic_store to __atomic_* library calls, instead of normal load / store, to avoid the mixture of lock-free instruction & non-lock-free library calls.
Conslusion 2: According to 1 & 2 & 3, If we have CAS support, we can just simply lower atomic_load / atomic_store / atomic_compare_and_swap to the native instruction, and expand the atomic_rmw to either atomic_compare_and_swap or __sync_* calls --- which all of them are lock-free.
Right ?
I think we can also transform atomic_compare_and_swap to its __sync counterpart, __sync_val_compare_and_swap, if we don't have CAS (i.e. < M68020). And I think __sync is preferable here because it makes more sense to conditionally lower cmpxchg to libcall based on target CPU in the backend , compared to detecting target CPU in an IR Pass.
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
---|---|---|
11 | you can use let Predicates = [IsM68020] in { ... to wrap both the instruction definitions and the patterns (the predicates won't be honored if you only wrap the instruction definitions, since we're using custom patterns). Though I just realized that the IsM680X0 predicates defined in M68kInstrInfo.td are wrong. I will fix that shortly. |
Right. You want to call setMaxAtomicSizeInBitsSupported(32) on targets that have 32-bit CAS.
If __sync_* is available for a given bitwidth, it counts as "having" CAS support. setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, LibCall); will tell the legalizer to generate the call for you.
We don't really make the distinction between "IR" passes vs. "CodeGen" passes. We draw the line between "optimization" passes (which are controlled by the frontend) and "backend" passes (which are directly controlled by the target in its implementation of TargetPassConfig). Some "backend" passes run on IR simply because it's convenient: LLVM IR is simpler to deal with (especially given the limitations of SelectionDAG).
update diff:
- We transform all atomic instruction to __atomic_* for sizes > 32
- Otherwise, lower to either native instruction or __sync_* function call.
The above two rules apply to all subtarget.
I think the patch is in good shape now. I only have some minor comments. Thanks for the works :-)
llvm/lib/Target/M68k/M68kISelLowering.cpp | ||
---|---|---|
169 | just want to double check, were these line formatted by clang-format? | |
llvm/lib/Target/M68k/M68kInstrAtomics.td | ||
11 |
Yes thank you! | |
42 | nit: align with !cast in the previous line. ditto for the following line. | |
45 | nit: "// let Predicates = [AtLeastM68020]" |
llvm/lib/Target/M68k/M68kISelLowering.cpp | ||
---|---|---|
167 | It probably also makes sense to make shouldExpandAtomicRMWInIR return AtomicExpansionKind::CmpXChg on M68020, to use inline loops for atomicrmw. |
llvm/lib/Target/M68k/M68kISelLowering.cpp | ||
---|---|---|
169 | Yes. |
update diff:
- Expand atomic-rmw to atomic-compare-and-swap on target >= M68020
- address feedbacks
It probably also makes sense to make shouldExpandAtomicRMWInIR return AtomicExpansionKind::CmpXChg on M68020, to use inline loops for atomicrmw.