This is an archive of the discontinued LLVM Phabricator instance.

[M68k] Add codegen pattern for atomic load / store
ClosedPublic

Authored by 0x59616e on Oct 22 2022, 5:56 AM.

Diff Detail

Event Timeline

0x59616e created this revision.Oct 22 2022, 5:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 22 2022, 5:56 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
0x59616e requested review of this revision.Oct 22 2022, 5:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 22 2022, 5:56 AM
0x59616e updated this revision to Diff 469952.Oct 22 2022, 7:42 PM
0x59616e retitled this revision from [WIP][M68k] Add implementation for atomic load / store to [M68k] Add codegen pattern for atomic load / store.
0x59616e edited the summary of this revision. (Show Details)
0x59616e added reviewers: myhsu, RKSimon, ricky26, glaubitz.
nikic added a subscriber: nikic.Oct 23 2022, 12:38 AM

Preferably this should also include the implementation for atomic RMW/CAS instructions to prove that this lowering is legal. If native or at least kernel-supported CAS is not available, then atomic load/store needs to use libatomic (possibly subtarget dependent).

It would also be good to include relevant quotes from the ISA manual -- atomicity of load/store is usually a given, but do they also guarantee a seq_cst ordering without a memory barrier?

(Disclaimer: I'm not familiar with m68k, just covering the usual atomic lowering legality questions.)

myhsu added a comment.EditedOct 23 2022, 10:51 AM

Agree with @nikic . I think it will be better for this patch to have support for CmpXchg using CAS (for 68020 and later) and fallback to library solution for older CPUs.

Preferably this should also include the implementation for atomic RMW/CAS instructions to prove that this lowering is legal. If native or at least kernel-supported CAS is not available, then atomic load/store needs to use libatomic (possibly subtarget dependent).

It would also be good to include relevant quotes from the ISA manual -- atomicity of load/store is usually a given, but do they also guarantee a seq_cst ordering without a memory barrier?

(Disclaimer: I'm not familiar with m68k, just covering the usual atomic lowering legality questions.)

Thanks for replying !

To be honest, I can find any word regarding the atomic ordering in the document. I can't find any memory barrier instruction either. I just follow what gcc does.

There is a discussion here: https://github.com/M680x0/M680x0-mono-repo/issues/13

0x59616e planned changes to this revision.Oct 23 2022, 6:07 PM

Embark on the CAS/RMW instruction.

Are multi-processor m68k computers a thing? I can't find any reference to such a thing existing, but the manual indicates that the processor was designed to allow it. If it does exist, m68k probably needs to use sequences similar to x86. (x86 didn't have any barrier instruction for a long time, but a "lock" instruction has the right semantics.)

Are multi-processor m68k computers a thing? I can't find any reference to such a thing existing, but the manual indicates that the processor was designed to allow it. If it does exist, m68k probably needs to use sequences similar to x86. (x86 didn't have any barrier instruction for a long time, but a "lock" instruction has the right semantics.)

Take my word with a pinch of salt

We may have to use the compare-and-swap instruction to reach that (or with the aid of glibc). But is it necessary on such an old architecture that may not have all the modern technologies such as out-of-order execution and store/write buffer --- which are the reasons the atomic instruction or memory barrier exists for ?

Are multi-processor m68k computers a thing? I can't find any reference to such a thing existing, but the manual indicates that the processor was designed to allow it. If it does exist, m68k probably needs to use sequences similar to x86. (x86 didn't have any barrier instruction for a long time, but a "lock" instruction has the right semantics.)

Take my word with a pinch of salt

We may have to use the compare-and-swap instruction to reach that (or with the aid of glibc). But is it necessary on such an old architecture that may not have all the modern technologies such as out-of-order execution and store/write buffer --- which are the reasons the atomic instruction or memory barrier exists for ?

I can't remember any details (so long ago, sniff) - but wasn't TAS (test and set) used as an alternative to CAS which only appeared on 68020 and later? I vaguely remember on Amigas, which although not truly multi-processor had a lot of custom chips accessing some of the memory at the same time, you had to be very careful about using TAS/CAS.

RKSimon added inline comments.Oct 25 2022, 5:39 AM
llvm/lib/Target/M68k/M68kTargetMachine.cpp
161

Also - do we need to add an pipeline.ll test file?

Are multi-processor m68k computers a thing? I can't find any reference to such a thing existing, but the manual indicates that the processor was designed to allow it. If it does exist, m68k probably needs to use sequences similar to x86. (x86 didn't have any barrier instruction for a long time, but a "lock" instruction has the right semantics.)

Take my word with a pinch of salt

We may have to use the compare-and-swap instruction to reach that (or with the aid of glibc). But is it necessary on such an old architecture that may not have all the modern technologies such as out-of-order execution and store/write buffer --- which are the reasons the atomic instruction or memory barrier exists for ?

Store buffers existed in the era of the m68k, I think. OOO was mostly later, though.

That said, we shouldn't try to theorycraft the semantics of a system that doesn't actually exist. I'm okay with assuming we have a single processor if the rest of the ecosystem makes the same assumption (which means we only have to care about interrupts, not memory reordering).

I dig into the libatomic.a, here is part of the result:

00000000 <__atomic_store_4>:
   0:   206f 0004       moveal %sp@(4),%a0
   4:   20af 0008       movel %sp@(8),%a0@
   8:   4e75            rts

gcc also compiles to the same result:

$ echo 'void foo(int *x, int y) { __atomic_store_4(x, y, __ATOMIC_SEQ_CST); }' | m68k-linux-gnu-gcc -m68040 -x c - -o - -S -O2
#NO_APP
	.file	"<stdin>"
	.text
	.align	2
	.globl	foo
	.type	foo, @function
foo:
	move.l 4(%sp),%a0
	move.l 8(%sp),(%a0)
	rts
	.size	foo, .-foo
	.ident	"GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0"
	.section	.note.GNU-stack,"",@progbits

Maybe the normal load / store is sufficient to match the atomic ordering semantic on m68k ?

llvm/lib/Target/M68k/M68kTargetMachine.cpp
161

What is pipeline.ll ?

RKSimon added inline comments.Oct 27 2022, 4:29 AM
llvm/lib/Target/M68k/M68kTargetMachine.cpp
161
0x59616e marked an inline comment as done.Oct 27 2022, 4:34 AM
0x59616e added inline comments.
llvm/lib/Target/M68k/M68kTargetMachine.cpp
161

Thanks. This looks interesting. As far as I'm concerned, there is no reason not to add this one in M68k.

0x59616e updated this revision to Diff 471715.Oct 28 2022, 10:31 PM
0x59616e marked an inline comment as done.

Add support for atomicrmw and cmpxchg

0x59616e updated this revision to Diff 471716.Oct 28 2022, 10:35 PM
0x59616e marked an inline comment as done.

Add pipeline.ll

Please can you pre-commit pipeline.ll for current trunk and then rebase so that the patch shows the diff?

llvm/lib/Target/M68k/M68kInstrAtomics.td
11

Wrap this inside FeatureISA20 ?

llvm/test/CodeGen/M68k/pipeline.ll
2

; CHECK

105

newline

0x59616e updated this revision to Diff 471889.Oct 30 2022, 6:13 PM
0x59616e marked an inline comment as done.

address feedbacks

llvm/lib/Target/M68k/M68kInstrAtomics.td
11

Is there any example I can refer to ?

I think it makes sense to assume m68k always runs on uniprocessors. GCC / libgcc and m68k port of Linux also makes the same assumption. For instance, gcc simply lowers atomic fence into asm volatile("" ::: "memory") (meaning the only thing we need to do is preventing compiler from reordering across the fence).

Though I don't think the changes on getMaxAtomicSizeInBitsSupported can really be justified. Given the fact that you can lower unsupported atomic operations to library calls in legalizer as well.

llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

I think you tried to turn every atomic operations but atomic_load / store / cmpxchg into libcall here. But even we don't turn them into libcalls in this pass, we still can do that during legalization by marking the corresponding SDNode as Expand or LibCall, right?

llvm/lib/Target/M68k/M68kInstrAtomics.td
28

Why do we need to wrap these patterns with multiclass?

llvm/test/MC/Disassembler/M68k/atomics.txt
4

what about other sizes?

llvm/test/MC/M68k/Atomics/cas.s
6

ditto other sizes

myhsu added inline comments.Oct 30 2022, 10:07 PM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

*corresponding operation as Expand or LibCall

0x59616e updated this revision to Diff 471948.Oct 31 2022, 3:22 AM

address feedbacks

llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

I'll look into it.

llvm/lib/Target/M68k/M68kInstrAtomics.td
28

Yeah that's verbose. I didn't think this through.

0x59616e marked an inline comment as done.Oct 31 2022, 3:22 AM
0x59616e marked 3 inline comments as done.
RKSimon added inline comments.Oct 31 2022, 3:40 AM
llvm/lib/Target/M68k/M68kInstrAtomics.td
28

Probably the best way to make this 020+ only is to wrap the patterns, something like:

let Predicates = [FeatureISA20] {
 foreach size = [8, 16, 32] in {
  ....
 }
}
0x59616e added inline comments.Nov 1 2022, 5:37 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

Marking it as LibCall can stop the type legalizer from expanding the AtomicStore with 64bits operand to AtomicSwap, which will be transformed to __sync_lock_test_and_set --- a function that I can't find in libatomic.a.

We may need to find a way to stop the type legalizer from doing it so that the legalizer can expand AtomicStore to library call. Any ideas ?

llvm/lib/Target/M68k/M68kInstrAtomics.td
28

Thanks. I'll look into it.

0x59616e added inline comments.Nov 1 2022, 5:44 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)
0x59616e added inline comments.Nov 1 2022, 5:46 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

*Marking it as LibCall CANNOT*

0x59616e added inline comments.Nov 1 2022, 8:33 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

We may want to set the ATOMIC_CMP_SWAP as LibCall when the target is below 020. But that also turns into __sync_* library call.

Here is my workaround: You can see from here:

https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/AtomicExpandPass.cpp#L178

that the AtomicExpandPass will check the alignment to decide whether should give it a go. We can add a IR pass that change the alignment of cmpxchg to zero so that the AtomicExpandPass will work for us.

Do you think this is auspicious ?

0x59616e added inline comments.Nov 1 2022, 8:52 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

Another solution is mark the ATOMIC_CMP_SWAP as Custom and lower to __atomic_compare_swap_* by ourself.

Which one do you prefer ?

0x59616e added inline comments.Nov 1 2022, 9:11 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
176 ↗(On Diff #471716)

Pros of the first solution: It is easier to implement.

Cons of the first solution: It's not that orthodox.

Pros of the second: it follows the tradition.

Cons of the second: We have to consider not only cas but also rmw instruction. It may require an effort.

RKSimon added inline comments.Nov 1 2022, 9:29 AM
llvm/lib/Target/M68k/M68kTargetMachine.cpp
161

So do we need to add TargetPassConfig::addIRPasses() to this call now?

The atomic width is never a property of a specific instruction. Either *all* atomic ops of a given width are lock-free, or *all* atomic ops of a given width need to be transformed into __atomic_* libcalls.

So the right thing to do is something like the following, in M68kTargetLowering::M68kTargetLowering():

if (Subtarget.atLeastM68020())
  setMaxAtomicSizeInBitsSupported(32);
else
  setMaxAtomicSizeInBitsSupported(0);

If you need to emulate rmw ops on top of cmpxchg, implementing shouldExpandAtomicRMWInIR will tell AtomicExpandPass to handle that expansion for you.

It's possible to emulate a lock-free atomic cmpxchg on a uniprocessor, for example, https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt . But the compiler can't do it itself; it needs some sort of operating system assistance. And as far as I know, Linux requires an 68020 anyway, so I don't see a reason for you to go down that path.

The atomic width is never a property of a specific instruction. Either *all* atomic ops of a given width are lock-free, or *all* atomic ops of a given width need to be transformed into __atomic_* libcalls.

I have a few questions:

  • So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
  • Why the atomic width is not a property of a specific instruction ?

Thanks !

llvm/lib/Target/M68k/M68kTargetMachine.cpp
161

Oh, I forgot it. I will address this later. Thanks !

0x59616e added a comment.EditedNov 1 2022, 6:56 PM

one more question :

  • is it legal to transform the atomic_load / atomic_store to a normal load / store before AtomicExpandPass ? Does that break any property that belongs to atomic instruction ?
myhsu added a comment.Nov 1 2022, 9:58 PM

Let me try to answer most of the questions at once. But first, here is the workflow I would do:

  1. For 68020 and later, setMaxAtomicSizeInBitsSupported(32) and setMaxAtomicSizeInBitsSupported(0) otherwise. AtomicExpandPass then replaces everything not in the size ranges with __atomic_*.
  2. AtomicLoad, AtomicStore and AtomicCmpXchg in target >= 68020 will be lowered to native instructions.
  3. Mark every other ISD::ATOMIC_* as LibCall, this effectively lowers them into library calls to __sync_*.

Now here are the questions:

Marking it as LibCall can stop the type legalizer from expanding the AtomicStore with 64bits operand to AtomicSwap,

I'm not sure how did you get this, given the fact that 64-bit AtomicStore should already been replaced by __atomic_store by AtomicExpandPass. So you don't need to mark ISD::ATOMIC_STORE as LibCall.

transformed to __sync_lock_test_and_set --- a function that I can't find in libatomic.a.

Correct, because __sync_* come from libgcc, the primary compiler runtime library for GNU toolchain. In the case of m68k, these __sync_* functions are "basically" lock-free -- they simply make a syscall that provides the cmpxchg feature (though that also means it's less portable than, said libatomic which is a pure C implementation). To my best understanding, the reason legalizer chooses to lower atomic operations into __sync_* rather than __atomic_* is because the former are more lower level and more importantly, libgcc is always available during the linking (when using GNU driver of course, which is the case for m68k-linux-gnu).

So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
Why the atomic width is not a property of a specific instruction ?

To my best understanding, LLVM makes the decision to segregate native atomic supports by size, rather than a specific instruction, simply because most architectures do that. You either have full or no atomic support for a certain size. Meaning, m68k, as an anomaly, needs to do some extra works on this matter, which is not difficult because I believe the workflow I put at the beginning of this comment can lower atomic load / store to native instructions (on supported sub architectures) while lowering other atomic operations to libcalls for a given size.

So I cannot lower atomic_load / atomic_store to native instruction whilst lower atomic_compare_and_swap to library call --- all of them should either be lowered to native instruction or library calls ?
Why the atomic width is not a property of a specific instruction ?

To my best understanding, LLVM makes the decision to segregate native atomic supports by size, rather than a specific instruction, simply because most architectures do that. You either have full or no atomic support for a certain size. Meaning, m68k, as an anomaly, needs to do some extra works on this matter, which is not difficult because I believe the workflow I put at the beginning of this comment can lower atomic load / store to native instructions (on supported sub architectures) while lowering other atomic operations to libcalls for a given size.

The __atomic_* libcalls are allowed to be implemented using a lock. If you try to mix in lock-free load and store operations, they won't respect that lock.

If m68k has __sync_* in libgcc, those are lock-free; you can use them alongside plain load/store instructions.

https://llvm.org/docs/Atomics.html has more details on various forms of calls.

0x59616e added a comment.EditedNov 2 2022, 2:21 AM

Thanks for all of your edifying comments. The path is getting clearer. Here is my understanding. Correct me if I'm wrong.

  1. __atomic_* is allowed to use lock, whilst __sync_* is lock-free.
  2. Non-lock-free function is NOT allowed to use simultaneously with the lock-free instruction or function, otherwise the lock may not be respected (i.e. race condition may occur)
  3. atomic_compare_and_swap needs to be transformed to __atomic_* if we don't have CAS support.

Conclusion 1: According to 1 & 2 & 3, if we don't have CAS support, we have to transform the atomic_load / atomic_store to __atomic_* library calls, instead of normal load / store, to avoid the mixture of lock-free instruction & non-lock-free library calls.

Conslusion 2: According to 1 & 2 & 3, If we have CAS support, we can just simply lower atomic_load / atomic_store / atomic_compare_and_swap to the native instruction, and expand the atomic_rmw to either atomic_compare_and_swap or __sync_* calls --- which all of them are lock-free.

Right ?

myhsu added a comment.Nov 3 2022, 3:40 PM

Thanks for all of your edifying comments. The path is getting clearer. Here is my understanding. Correct me if I'm wrong.

  1. __atomic_* is allowed to use lock, whilst __sync_* is lock-free.
  2. Non-lock-free function is NOT allowed to use simultaneously with the lock-free instruction or function, otherwise the lock may not be respected (i.e. race condition may occur)
  3. atomic_compare_and_swap needs to be transformed to __atomic_* if we don't have CAS support.

I think we can also transform atomic_compare_and_swap to its __sync counterpart, __sync_val_compare_and_swap, if we don't have CAS (i.e. < M68020). And I think __sync is preferable here because it makes more sense to conditionally lower cmpxchg to libcall based on target CPU in the backend , compared to detecting target CPU in an IR Pass.

myhsu added inline comments.Nov 3 2022, 3:46 PM
llvm/lib/Target/M68k/M68kInstrAtomics.td
11

you can use let Predicates = [IsM68020] in { ... to wrap both the instruction definitions and the patterns (the predicates won't be honored if you only wrap the instruction definitions, since we're using custom patterns). Though I just realized that the IsM680X0 predicates defined in M68kInstrInfo.td are wrong. I will fix that shortly.

Conclusion 1: According to 1 & 2 & 3, if we don't have CAS support, we have to transform the atomic_load / atomic_store to __atomic_* library calls, instead of normal load / store, to avoid the mixture of lock-free instruction & non-lock-free library calls.

Conslusion 2: According to 1 & 2 & 3, If we have CAS support, we can just simply lower atomic_load / atomic_store / atomic_compare_and_swap to the native instruction, and expand the atomic_rmw to either atomic_compare_and_swap or __sync_* calls --- which all of them are lock-free.

Right ?

Right. You want to call setMaxAtomicSizeInBitsSupported(32) on targets that have 32-bit CAS.

If __sync_* is available for a given bitwidth, it counts as "having" CAS support. setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, LibCall); will tell the legalizer to generate the call for you.

And I think __sync is preferable here because it makes more sense to conditionally lower cmpxchg to libcall based on target CPU in the backend , compared to detecting target CPU in an IR Pass.

We don't really make the distinction between "IR" passes vs. "CodeGen" passes. We draw the line between "optimization" passes (which are controlled by the frontend) and "backend" passes (which are directly controlled by the target in its implementation of TargetPassConfig). Some "backend" passes run on IR simply because it's convenient: LLVM IR is simpler to deal with (especially given the limitations of SelectionDAG).

0x59616e updated this revision to Diff 473237.EditedNov 4 2022, 7:29 AM

update diff:

  • We transform all atomic instruction to __atomic_* for sizes > 32
  • Otherwise, lower to either native instruction or __sync_* function call.

The above two rules apply to all subtarget.

0x59616e marked 4 inline comments as done.
0x59616e added inline comments.
llvm/lib/Target/M68k/M68kInstrAtomics.td
11

Something like D137425 ?

0x59616e marked an inline comment as done.Nov 4 2022, 7:34 AM
myhsu added a comment.EditedNov 4 2022, 9:17 AM

I think the patch is in good shape now. I only have some minor comments. Thanks for the works :-)

llvm/lib/Target/M68k/M68kISelLowering.cpp
169

just want to double check, were these line formatted by clang-format?

llvm/lib/Target/M68k/M68kInstrAtomics.td
11

Something like D137425 ?

Yes thank you!

42

nit: align with !cast in the previous line. ditto for the following line.

45

nit: "// let Predicates = [AtLeastM68020]"

efriedma added inline comments.Nov 4 2022, 12:55 PM
llvm/lib/Target/M68k/M68kISelLowering.cpp
167

It probably also makes sense to make shouldExpandAtomicRMWInIR return AtomicExpansionKind::CmpXChg on M68020, to use inline loops for atomicrmw.

0x59616e marked an inline comment as done.Nov 6 2022, 11:19 PM
0x59616e added inline comments.
llvm/lib/Target/M68k/M68kISelLowering.cpp
169

Yes.

0x59616e marked an inline comment as done.Nov 6 2022, 11:19 PM
0x59616e updated this revision to Diff 473897.Nov 8 2022, 12:24 AM
0x59616e marked 4 inline comments as done.

update diff:

  • Expand atomic-rmw to atomic-compare-and-swap on target >= M68020
  • address feedbacks
myhsu accepted this revision.Nov 8 2022, 7:00 AM

LGTM thank you! Please also include cmpxchg and rmw in the title.

This revision is now accepted and ready to land.Nov 8 2022, 7:00 AM
efriedma accepted this revision.Nov 8 2022, 8:55 AM

LGTM

Thanks a lot for all of your benign help ;)

This revision was automatically updated to reflect the committed changes.