This is an archive of the discontinued LLVM Phabricator instance.

Add TargetLowering::shouldExpandAtomicToLibCall and query it from AtomicExpandPass
AbandonedPublic

Authored by asb on May 30 2018, 1:07 PM.

Details

Summary

Currently, AtomicExpandPass will expand atomic instructions to an __atomic_*
libcall in the if the size is greater than MaxSizeInBitsSupported or the
alignment is less than the value's natural alignment. The only way a target
can influence this is by changing MaxSizeInBitsSupported, but this is a very
blunt instrument. A target might, for instance, want to always generate
libcalls for atomics less than the native word size but generate inline
lock-free code for word-sized atomics. If the library implementation is known
to be lock-free, the target as even greater flexibility such as choosing to
emit a libcall or not depending on wheter optimising for size.

The new TargetLowering::shouldExpandAtomicToLibCall makes this degree of
flexibility possible. The default implementation makes use of
MaxSizeInBitsSupported so there is no functional change for in-tree or
out-of-tree targets that don't implement their own
shouldExpandAtomicToLibCall.

Diff Detail

Event Timeline

asb created this revision.May 30 2018, 1:07 PM

Your suggested lowering is illegal according to the API rules for libatomic; for any given size/alignment, either all operations are lock-free, or all operations use locks. Otherwise, atomics aren't threadsafe. See also https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary.

ARM Linux gets around this issue by lowering atomic cmpxchg to call __sync_val_compare_and_swap_4 etc. instead; the implementation uses a special lock-free helper provided by the kernel. See https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt .

jfb added a comment.May 30 2018, 1:32 PM

Could you detail what type of code will generate this? In C++ the standard library is responsible for guaranteeing properly sized and aligned atomics. Size will always be accurate, but I'm worried that your change wants perfect alignment information and you might not have that information. Say two pieces of code communicate through the same atomic location, but they have different alignment information, and one gets a libcall and the other doesn't... You're in trouble.

asb added a comment.May 30 2018, 1:33 PM

Thanks Eli, I actually just came to the same conclusion when taking a step back and thinking about it some more. I was caught up on matching RISC-V GCC behaviour in this case, when in fact it seems RISC-V GCC is broken here. The LLVM docs on atomics also do document this requirement https://llvm.org/docs/Atomics.html#atomics-and-codegen. The 'blunt instrument' of MaxSizeInBitsSupported is acting as designed.

asb abandoned this revision.May 30 2018, 2:22 PM

Thanks Eli and JF for the rapid feedback. I was too quick to try to replicate GCC behaviour without running sanity checking its output.

I've filed a GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005 and am marking this revision abandoned.

asb added a subscriber: theraven.May 31 2018, 6:52 AM

I'm going to un-abandon this patch. This hook is still helpful in order to e.g. lower to __atomic libcalls for all 8 and 16-bit operations but not 32-bit. This seems acceptable based on the linked GCC and LLVM docs, but isn't possible with the current MaxSizeInBitsSupported. Am I missing any problems with that strategy?

It seems GCC RISC-V will emit __atomic_* libcalls for 8 and 16-bit operations other than load/store even with -march=rv32ia. The immediate response in the linked GCC bug report is that perhaps that fine if the RISC-V version of these sub-word atomics is guaranteed to be lock-free (which apparently is the case for libatomic compiled for rv32ia). If this is confirmed, it would give a case where intermixing is safe. I don't know how reasonable it is to declare __atomic_* libcalls are lock-free if compiled with support for the A extension - if you have views please do jump on the GCC bug report thread.

asb updated this revision to Diff 149284.May 31 2018, 6:54 AM
asb edited the summary of this revision. (Show Details)

Updated patch to document that shouldExpandAtomicToLibCall should return the same result for objects of the same size.

@jfb: I'm simply replicating the existing logic with regards to checking alignment - no functional change is intended.

In D47553#1117630, @asb wrote:

I'm going to un-abandon this patch. This hook is still helpful in order to e.g. lower to __atomic libcalls for all 8 and 16-bit operations but not 32-bit. This seems acceptable based on the linked GCC and LLVM docs, but isn't possible with the current MaxSizeInBitsSupported. Am I missing any problems with that strategy?

It seems GCC RISC-V will emit __atomic_* libcalls for 8 and 16-bit operations other than load/store even with -march=rv32ia. The immediate response in the linked GCC bug report is that perhaps that fine if the RISC-V version of these sub-word atomics is guaranteed to be lock-free (which apparently is the case for libatomic compiled for rv32ia). If this is confirmed, it would give a case where intermixing is safe. I don't know how reasonable it is to declare __atomic_* libcalls are lock-free if compiled with support for the A extension - if you have views please do jump on the GCC bug report thread.

No...while it shouldn't be _broken_, it makes no sense to do so. If you can do 32-bit operations atomically, you can of course do the same for 8 and 16.

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

asb added a comment.EditedMay 31 2018, 8:25 AM

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

We can absolutely lower to ll+sc, which is what the 8 and 16-bit libatomic implementations do for rv32ia. My thought was simply:

  1. Simply generating the libcall is easy and correct. This gives a stepping stone allowing the straight-forward implementation of initial support that can then be improved by switching to generating in-line instruction sequences
  2. If we are happy that the 8 and 16bit __atomic_* libcalls are lock-free, it is safe to choose between inline instruction sequences and libcalls freely. In that case, a hook such as this would allow you to use the libcall when optimising for code size.

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

jfb added a comment.May 31 2018, 9:02 AM
In D47553#1117763, @asb wrote:

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

We can absolutely lower to ll+sc, which is what the 8 and 16-bit libatomic implementations do for rv32ia. My thought was simply:

  1. Simply generating the libcall is easy and correct. This gives a stepping stone allowing the straight-forward implementation of initial support that can then be improved by switching to generating in-line instruction sequences
  2. If we are happy that the 8 and 16bit __atomic_* libcalls are lock-free, it is safe to choose between inline instruction sequences and libcalls freely. In that case, a hook such as this would allow you to use the libcall when optimising for code size.

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

I think you want the ll/sc code for smaller atomics, it's an assumption on many platforms that you have lock-free 8 / 16. / 32 bit atomics, and often double-pointer-wide atomics too. I don't think your libcalls will generally be lock-free. I'd be worried if, even for bringup, your platform at one point in time didn't have that property, because then it sets a precedent for people saying "oh but this one time RISCV didn't do *blah*" and then we'll have endless debates about silly platforms. Please avoid me the debates ๐Ÿ˜‰

jfb added a comment.May 31 2018, 9:07 AM
In D47553#1117633, @asb wrote:

Updated patch to document that shouldExpandAtomicToLibCall should return the same result for objects of the same size.

@jfb: I'm simply replicating the existing logic with regards to checking alignment - no functional change is intended.

I see! Can you refer to the langref, which says this is UB, and say you're trying your best to not be a jerk about it? Or something like that :-)

asb added a comment.May 31 2018, 12:45 PM
In D47553#1117786, @jfb wrote:
In D47553#1117763, @asb wrote:

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

I think you want the ll/sc code for smaller atomics, it's an assumption on many platforms that you have lock-free 8 / 16. / 32 bit atomics, and often double-pointer-wide atomics too. I don't think your libcalls will generally be lock-free. I'd be worried if, even for bringup, your platform at one point in time didn't have that property, because then it sets a precedent for people saying "oh but this one time RISCV didn't do *blah*" and then we'll have endless debates about silly platforms. Please avoid me the debates ๐Ÿ˜‰

I have a lot of sympathy for @jyknight's suggestion on the GCC bug thread - that if RISC-V wants to offer different guarantees for atomic libcalls vs other platforms it should use a different name for them. I haven't worked with these libcalls before and haven't followed the history, so I lack the credentials to form a strong opinion. If you have a strong feeling one way or the other, please do speak up in the GCC bug thread.

If GCC is just doing the wrong thing then obviously we shouldn't blindly copy. However, _if_ it is reasonable for RISC-V to specify a platform/ABI constraint that __atomic_* libcalls <= native register size are lock-free for -march=rv32ia, then it would be a shame if LLVM were unable to exploit that fact.

__atomic_* should be lock-free for i8/i16/i32 on rv32ia; that's not really a special requirement, it's necessary to allow linking rv32i and rv32ia code.

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

asb added a comment.May 31 2018, 1:03 PM

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

Excellent point. That's definitely enough motivation to avoid calling the __atomic_* libcalls when not strictly necessary. I'll rework my patches to lower to ll/sc for small atomics on RV32IA from the start. Of course we then run into the problem encountered by Mips - the potential for a stack spill to be inserted between the ll/sc and to break things.

In D47553#1118083, @asb wrote:

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

Excellent point. That's definitely enough motivation to avoid calling the __atomic_* libcalls when not strictly necessary. I'll rework my patches to lower to ll/sc for small atomics on RV32IA from the start. Of course we then run into the problem encountered by Mips - the potential for a stack spill to be inserted between the ll/sc and to break things.

If you haven't seen it, you might want to read the thread I started a while ago, http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html discussing how AtomicExpandPass's feature of generating LL/SC loops at the IR level is a bad idea, and we need a better mechanism in order for it to generate actually-correct code. (We really ought to not expose ll and sc primitives to the IR level at all, because it's impossible to use them correctly).

That fundamental issue hasn't actually been resolved yet -- the current behavior seems to "work well enough" usually. But, I don't know whether the RISCV hardware requires strict adherence to its spec of max 16 base-ISA instructions, except loads, stores, or taken-branches. If so, the generic code in AtomicExpandPass cannot make that guarantee.

asb added a comment.Jun 1 2018, 10:06 AM

If you haven't seen it, you might want to read the thread I started a while ago, http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html discussing how AtomicExpandPass's feature of generating LL/SC loops at the IR level is a bad idea, and we need a better mechanism in order for it to generate actually-correct code. (We really ought to not expose ll and sc primitives to the IR level at all, because it's impossible to use them correctly).

That fundamental issue hasn't actually been resolved yet -- the current behavior seems to "work well enough" usually. But, I don't know whether the RISCV hardware requires strict adherence to its spec of max 16 base-ISA instructions, except loads, stores, or taken-branches. If so, the generic code in AtomicExpandPass cannot make that guarantee.

Thanks James, yes I recently re-read that post and had a good look at what ARM, AArch64 and Mips are doing here. Mips will move to lowering in a post-RA pseudo-expansion pas with D31287, similar to the ARM/AArch64 -O0 approach.

My current implementation is not using the shouldExpandAtomic*InIR hooks to avoid the problems you outline. I'm going the post-RA pseudo-lowering route currently. That still leaves a concern about MachineBlockPlacement, but this shouldn't be a realistic concern if you set static branch probabilities appropriately. It would be nice to not have this inter-dependency of course, but it's probably not the greatest evil. I haven't properly examined the potential for the MachineOutliner to break load-linked/store-conditional pairs.

It's somewhat tempting to either emit inline assembly or to expand the pseudo MachineInstruction only when lowering to MCInst.

MachineOutliner doesn't do anything unless target-specific hooks say a transform is safe; it should be possible to guard against the possibility of outlining an ll/sc pair.

asb added a comment.Jun 1 2018, 1:13 PM

MachineOutliner doesn't do anything unless target-specific hooks say a transform is safe; it should be possible to guard against the possibility of outlining an ll/sc pair.

I think I can disable outlining for an entire function using isFunctionSafeToOutlineFrom, but otherwise getOutliningCandidateInfo lets you opt-out on a per-instruction basis. I'd be concerned about some of the integer instructions between lr/sc being outlined, unless isFunctionSafeToOutlineFrom is used to disable outlining for any function that contains atomics.

MachineOutliner will probably become more flexible soon; see https://bugs.llvm.org/show_bug.cgi?id=37573 .

In D47553#1119498, @asb wrote:

I think I can disable outlining for an entire function using isFunctionSafeToOutlineFrom

Yeah, that'll work. It's conservative, but it will ensure that the outliner will never touch the function, and thus, won't touch the instructions (or anything between them).

but otherwise getOutliningCandidateInfo lets you opt-out on a per-instruction basis. I'd be concerned about some of the integer instructions between lr/sc being outlined, unless isFunctionSafeToOutlineFrom is used to disable outlining for any function that contains atomics.

After D47654 has landed, it should be possible to use a technique similar to D47655? That patch (D47654) makes it possible to discard candidates based off of target-specific information (e.g, register liveness, etc.)

If some sort of register liveness information is sufficient, I could probably just drop it into D47655...

Thank you everyone for your comments. I've removed the dependency on this patch from my queue of patches, and instead implement 8/16/32-bit atomics in one go.

asb abandoned this revision.Sep 18 2018, 8:21 AM

Re-abandoning.