This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/CodeGen/
-
CodeGen/
-
AtomicExpandPass.cpp

Differential D47553

Add TargetLowering::shouldExpandAtomicToLibCall and query it from AtomicExpandPass
AbandonedPublic

Authored by asb on May 30 2018, 1:07 PM.

Download Raw Diff

Details

Reviewers

jfb
jyknight
kparzysz

Summary

Currently, AtomicExpandPass will expand atomic instructions to an __atomic_*
libcall in the if the size is greater than MaxSizeInBitsSupported or the
alignment is less than the value's natural alignment. The only way a target
can influence this is by changing MaxSizeInBitsSupported, but this is a very
blunt instrument. A target might, for instance, want to always generate
libcalls for atomics less than the native word size but generate inline
lock-free code for word-sized atomics. If the library implementation is known
to be lock-free, the target as even greater flexibility such as choosing to
emit a libcall or not depending on wheter optimising for size.

The new TargetLowering::shouldExpandAtomicToLibCall makes this degree of
flexibility possible. The default implementation makes use of
MaxSizeInBitsSupported so there is no functional change for in-tree or
out-of-tree targets that don't implement their own
shouldExpandAtomicToLibCall.

Diff Detail

Event Timeline

asb created this revision.May 30 2018, 1:07 PM

Your suggested lowering is illegal according to the API rules for libatomic; for any given size/alignment, either all operations are lock-free, or all operations use locks. Otherwise, atomics aren't threadsafe. See also https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary.

ARM Linux gets around this issue by lowering atomic cmpxchg to call __sync_val_compare_and_swap_4 etc. instead; the implementation uses a special lock-free helper provided by the kernel. See https://www.kernel.org/doc/Documentation/arm/kernel_user_helpers.txt .

Could you detail what type of code will generate this? In C++ the standard library is responsible for guaranteeing properly sized and aligned atomics. Size will always be accurate, but I'm worried that your change wants perfect alignment information and you might not have that information. Say two pieces of code communicate through the same atomic location, but they have different alignment information, and one gets a libcall and the other doesn't... You're in trouble.

Thanks Eli, I actually just came to the same conclusion when taking a step back and thinking about it some more. I was caught up on matching RISC-V GCC behaviour in this case, when in fact it seems RISC-V GCC is broken here. The LLVM docs on atomics also do document this requirement https://llvm.org/docs/Atomics.html#atomics-and-codegen. The 'blunt instrument' of MaxSizeInBitsSupported is acting as designed.

Thanks Eli and JF for the rapid feedback. I was too quick to try to replicate GCC behaviour without running sanity checking its output.

I've filed a GCC bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005 and am marking this revision abandoned.

I'm going to un-abandon this patch. This hook is still helpful in order to e.g. lower to __atomic libcalls for all 8 and 16-bit operations but not 32-bit. This seems acceptable based on the linked GCC and LLVM docs, but isn't possible with the current MaxSizeInBitsSupported. Am I missing any problems with that strategy?

It seems GCC RISC-V will emit __atomic_* libcalls for 8 and 16-bit operations other than load/store even with -march=rv32ia. The immediate response in the linked GCC bug report is that perhaps that fine if the RISC-V version of these sub-word atomics is guaranteed to be lock-free (which apparently is the case for libatomic compiled for rv32ia). If this is confirmed, it would give a case where intermixing is safe. I don't know how reasonable it is to declare __atomic_* libcalls are lock-free if compiled with support for the A extension - if you have views please do jump on the GCC bug report thread.

Updated patch to document that shouldExpandAtomicToLibCall should return the same result for objects of the same size.

@jfb: I'm simply replicating the existing logic with regards to checking alignment - no functional change is intended.

asb mentioned this in D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.May 31 2018, 6:59 AM

asb added a child revision: D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.

In D47553#1117630, @asb wrote:

I'm going to un-abandon this patch. This hook is still helpful in order to e.g. lower to __atomic libcalls for all 8 and 16-bit operations but not 32-bit. This seems acceptable based on the linked GCC and LLVM docs, but isn't possible with the current MaxSizeInBitsSupported. Am I missing any problems with that strategy?

It seems GCC RISC-V will emit __atomic_* libcalls for 8 and 16-bit operations other than load/store even with -march=rv32ia. The immediate response in the linked GCC bug report is that perhaps that fine if the RISC-V version of these sub-word atomics is guaranteed to be lock-free (which apparently is the case for libatomic compiled for rv32ia). If this is confirmed, it would give a case where intermixing is safe. I don't know how reasonable it is to declare __atomic_* libcalls are lock-free if compiled with support for the A extension - if you have views please do jump on the GCC bug report thread.

No...while it shouldn't be _broken_, it makes no sense to do so. If you can do 32-bit operations atomically, you can of course do the same for 8 and 16.

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

In D47553#1117753, @efriedma wrote:

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

We can absolutely lower to ll+sc, which is what the 8 and 16-bit libatomic implementations do for rv32ia. My thought was simply:

Simply generating the libcall is easy and correct. This gives a stepping stone allowing the straight-forward implementation of initial support that can then be improved by switching to generating in-line instruction sequences
If we are happy that the 8 and 16bit __atomic_* libcalls are lock-free, it is safe to choose between inline instruction sequences and libcalls freely. In that case, a hook such as this would allow you to use the libcall when optimising for code size.

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

sabuasal added a subscriber: sabuasal.May 31 2018, 8:47 AM

In D47553#1117763, @asb wrote:

In D47553#1117753, @efriedma wrote:

On other architectures which have 32-bit atomics, but not 8 and 16-bit atomics, it's generally possible to lower smaller operations using a 32-bit cmpxchg loop. Does that not work on RISCV for some reason?

We can absolutely lower to ll+sc, which is what the 8 and 16-bit libatomic implementations do for rv32ia. My thought was simply:

Simply generating the libcall is easy and correct. This gives a stepping stone allowing the straight-forward implementation of initial support that can then be improved by switching to generating in-line instruction sequences

If we are happy that the 8 and 16bit __atomic_* libcalls are lock-free, it is safe to choose between inline instruction sequences and libcalls freely. In that case, a hook such as this would allow you to use the libcall when optimising for code size.

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

I think you want the ll/sc code for smaller atomics, it's an assumption on many platforms that you have lock-free 8 / 16. / 32 bit atomics, and often double-pointer-wide atomics too. I don't think your libcalls will generally be lock-free. I'd be worried if, even for bringup, your platform at one point in time didn't have that property, because then it sets a precedent for people saying "oh but this one time RISCV didn't do *blah*" and then we'll have endless debates about silly platforms. Please avoid me the debates 😉

In D47553#1117633, @asb wrote:

Updated patch to document that shouldExpandAtomicToLibCall should return the same result for objects of the same size.

@jfb: I'm simply replicating the existing logic with regards to checking alignment - no functional change is intended.

I see! Can you refer to the langref, which says this is UB, and say you're trying your best to not be a jerk about it? Or something like that :-)

In D47553#1117786, @jfb wrote:

In D47553#1117763, @asb wrote:

I could understand an argument that use-case 1) isn't compelling enough to add this new hook, and determining whether use-case 2) makes sense is dependent on the conclusion of the GCC bug 86005 discussion. Thanks @jyknight for sharing your thoughts in that GCC thread - much appreciated.

I think you want the ll/sc code for smaller atomics, it's an assumption on many platforms that you have lock-free 8 / 16. / 32 bit atomics, and often double-pointer-wide atomics too. I don't think your libcalls will generally be lock-free. I'd be worried if, even for bringup, your platform at one point in time didn't have that property, because then it sets a precedent for people saying "oh but this one time RISCV didn't do *blah*" and then we'll have endless debates about silly platforms. Please avoid me the debates 😉

I have a lot of sympathy for @jyknight's suggestion on the GCC bug thread - that if RISC-V wants to offer different guarantees for atomic libcalls vs other platforms it should use a different name for them. I haven't worked with these libcalls before and haven't followed the history, so I lack the credentials to form a strong opinion. If you have a strong feeling one way or the other, please do speak up in the GCC bug thread.

If GCC is just doing the wrong thing then obviously we shouldn't blindly copy. However, _if_ it is reasonable for RISC-V to specify a platform/ABI constraint that __atomic_* libcalls <= native register size are lock-free for -march=rv32ia, then it would be a shame if LLVM were unable to exploit that fact.

__atomic_* should be lock-free for i8/i16/i32 on rv32ia; that's not really a special requirement, it's necessary to allow linking rv32i and rv32ia code.

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

In D47553#1118067, @efriedma wrote:

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

Excellent point. That's definitely enough motivation to avoid calling the __atomic_* libcalls when not strictly necessary. I'll rework my patches to lower to ll/sc for small atomics on RV32IA from the start. Of course we then run into the problem encountered by Mips - the potential for a stack spill to be inserted between the ll/sc and to break things.

In D47553#1118083, @asb wrote:

In D47553#1118067, @efriedma wrote:

But even if they are lock-free, it still requires linking to the shared library libatomic; we don't want to impose that requirement on users if it isn't necessary.

Excellent point. That's definitely enough motivation to avoid calling the __atomic_* libcalls when not strictly necessary. I'll rework my patches to lower to ll/sc for small atomics on RV32IA from the start. Of course we then run into the problem encountered by Mips - the potential for a stack spill to be inserted between the ll/sc and to break things.

If you haven't seen it, you might want to read the thread I started a while ago, http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html discussing how AtomicExpandPass's feature of generating LL/SC loops at the IR level is a bad idea, and we need a better mechanism in order for it to generate actually-correct code. (We really ought to not expose ll and sc primitives to the IR level at all, because it's impossible to use them correctly).

That fundamental issue hasn't actually been resolved yet -- the current behavior seems to "work well enough" usually. But, I don't know whether the RISCV hardware requires strict adherence to its spec of max 16 base-ISA instructions, except loads, stores, or taken-branches. If so, the generic code in AtomicExpandPass cannot make that guarantee.

In D47553#1119021, @jyknight wrote:

If you haven't seen it, you might want to read the thread I started a while ago, http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html discussing how AtomicExpandPass's feature of generating LL/SC loops at the IR level is a bad idea, and we need a better mechanism in order for it to generate actually-correct code. (We really ought to not expose ll and sc primitives to the IR level at all, because it's impossible to use them correctly).

That fundamental issue hasn't actually been resolved yet -- the current behavior seems to "work well enough" usually. But, I don't know whether the RISCV hardware requires strict adherence to its spec of max 16 base-ISA instructions, except loads, stores, or taken-branches. If so, the generic code in AtomicExpandPass cannot make that guarantee.

Thanks James, yes I recently re-read that post and had a good look at what ARM, AArch64 and Mips are doing here. Mips will move to lowering in a post-RA pseudo-expansion pas with D31287, similar to the ARM/AArch64 -O0 approach.

My current implementation is not using the shouldExpandAtomic*InIR hooks to avoid the problems you outline. I'm going the post-RA pseudo-lowering route currently. That still leaves a concern about MachineBlockPlacement, but this shouldn't be a realistic concern if you set static branch probabilities appropriately. It would be nice to not have this inter-dependency of course, but it's probably not the greatest evil. I haven't properly examined the potential for the MachineOutliner to break load-linked/store-conditional pairs.

It's somewhat tempting to either emit inline assembly or to expand the pseudo MachineInstruction only when lowering to MCInst.

MachineOutliner doesn't do anything unless target-specific hooks say a transform is safe; it should be possible to guard against the possibility of outlining an ll/sc pair.

In D47553#1119463, @efriedma wrote:

MachineOutliner doesn't do anything unless target-specific hooks say a transform is safe; it should be possible to guard against the possibility of outlining an ll/sc pair.

I think I can disable outlining for an entire function using isFunctionSafeToOutlineFrom, but otherwise getOutliningCandidateInfo lets you opt-out on a per-instruction basis. I'd be concerned about some of the integer instructions between lr/sc being outlined, unless isFunctionSafeToOutlineFrom is used to disable outlining for any function that contains atomics.

MachineOutliner will probably become more flexible soon; see https://bugs.llvm.org/show_bug.cgi?id=37573 .

In D47553#1119498, @asb wrote:

I think I can disable outlining for an entire function using isFunctionSafeToOutlineFrom

Yeah, that'll work. It's conservative, but it will ensure that the outliner will never touch the function, and thus, won't touch the instructions (or anything between them).

but otherwise getOutliningCandidateInfo lets you opt-out on a per-instruction basis. I'd be concerned about some of the integer instructions between lr/sc being outlined, unless isFunctionSafeToOutlineFrom is used to disable outlining for any function that contains atomics.

After D47654 has landed, it should be possible to use a technique similar to D47655? That patch (D47654) makes it possible to discard candidates based off of target-specific information (e.g, register liveness, etc.)

If some sort of register liveness information is sufficient, I could probably just drop it into D47655...

Thank you everyone for your comments. I've removed the dependency on this patch from my queue of patches, and instead implement 8/16/32-bit atomics in one go.

Re-abandoning.

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetLowering.h

15 lines

lib/

CodeGen/

AtomicExpandPass.cpp

22 lines

Diff 149284

include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,475 Lines • ▼ Show 20 Lines	public:

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
/// \name Helpers for atomic expansion.		/// \name Helpers for atomic expansion.
/// @{		/// @{

/// Returns the maximum atomic operation size (in bits) supported by		/// Returns the maximum atomic operation size (in bits) supported by
/// the backend. Atomic operations greater than this size (as well		/// the backend. Atomic operations greater than this size (as well
/// as ones that are not naturally aligned), will be expanded by		/// as ones that are not naturally aligned), will be expanded by
/// AtomicExpandPass into an __atomic_* library call.		/// AtomicExpandPass into an __atomic_* library call unless
		/// shouldExpandAtomicToLibCall is overridden.
unsigned getMaxAtomicSizeInBitsSupported() const {		unsigned getMaxAtomicSizeInBitsSupported() const {
return MaxAtomicSizeInBitsSupported;		return MaxAtomicSizeInBitsSupported;
}		}

		/// Indicates whether this atomic should be expanded to a libcall. Defaults to
		/// expanding when the atomic is less than naturally aligned or its size is
		/// greater than MaxAtomicSizeInBitsSupported. If the atomic isn't expanded
		/// it will be passed through for target lowering. Targets overriding this
		/// method should give the same result for atomics of the same size unless
		/// the libcall is known to be lock-free, in order to avoid the invalid
		/// mixing of inline lock-free atomic operations and locked libcalls.
		virtual bool shouldExpandAtomicToLibCall(unsigned Size, unsigned Align,
		Instruction *I) const {
		return Align < Size \|\| Size > getMaxAtomicSizeInBitsSupported() / 8;
		}

/// Returns the size of the smallest cmpxchg or ll/sc instruction		/// Returns the size of the smallest cmpxchg or ll/sc instruction
/// the backend supports. Any smaller operations are widened in		/// the backend supports. Any smaller operations are widened in
/// AtomicExpandPass.		/// AtomicExpandPass.
///		///
/// Note that unlike operations above the maximum size, atomic ops		/// Note that unlike operations above the maximum size, atomic ops
/// are still natively supported below the minimum; they just		/// are still natively supported below the minimum; they just
/// require a more complex expansion.		/// require a more complex expansion.
unsigned getMinCmpXchgSizeInBits() const { return MinCmpXchgSizeInBits; }		unsigned getMinCmpXchgSizeInBits() const { return MinCmpXchgSizeInBits; }
▲ Show 20 Lines • Show All 2,171 Lines • Show Last 20 Lines

lib/CodeGen/AtomicExpandPass.cpp

Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines
}		}

static unsigned getAtomicOpAlign(AtomicCmpXchgInst *CASI) {		static unsigned getAtomicOpAlign(AtomicCmpXchgInst *CASI) {
// TODO(PR27168): same comment as above.		// TODO(PR27168): same comment as above.
const DataLayout &DL = CASI->getModule()->getDataLayout();		const DataLayout &DL = CASI->getModule()->getDataLayout();
return DL.getTypeStoreSize(CASI->getCompareOperand()->getType());		return DL.getTypeStoreSize(CASI->getCompareOperand()->getType());
}		}

// Determine if a particular atomic operation has a supported size,
// and is of appropriate alignment, to be passed through for target
// lowering. (Versus turning into a __atomic libcall)
template <typename Inst>
static bool atomicSizeSupported(const TargetLowering TLI, Inst I) {
unsigned Size = getAtomicOpSize(I);
unsigned Align = getAtomicOpAlign(I);
return Align >= Size && Size <= TLI->getMaxAtomicSizeInBitsSupported() / 8;
}

bool AtomicExpand::runOnFunction(Function &F) {		bool AtomicExpand::runOnFunction(Function &F) {
auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();		auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();
if (!TPC)		if (!TPC)
return false;		return false;

auto &TM = TPC->getTM<TargetMachine>();		auto &TM = TPC->getTM<TargetMachine>();
if (!TM.getSubtargetImpl(F)->enableAtomicExpand())		if (!TM.getSubtargetImpl(F)->enableAtomicExpand())
return false;		return false;
Show All 14 Lines	for (auto I : AtomicInsts) {
auto LI = dyn_cast<LoadInst>(I);		auto LI = dyn_cast<LoadInst>(I);
auto SI = dyn_cast<StoreInst>(I);		auto SI = dyn_cast<StoreInst>(I);
auto RMWI = dyn_cast<AtomicRMWInst>(I);		auto RMWI = dyn_cast<AtomicRMWInst>(I);
auto CASI = dyn_cast<AtomicCmpXchgInst>(I);		auto CASI = dyn_cast<AtomicCmpXchgInst>(I);
assert((LI \|\| SI \|\| RMWI \|\| CASI) && "Unknown atomic instruction");		assert((LI \|\| SI \|\| RMWI \|\| CASI) && "Unknown atomic instruction");

// If the Size/Alignment is not supported, replace with a libcall.		// If the Size/Alignment is not supported, replace with a libcall.
if (LI) {		if (LI) {
if (!atomicSizeSupported(TLI, LI)) {		if (TLI->shouldExpandAtomicToLibCall(getAtomicOpSize(LI),
		getAtomicOpAlign(LI), LI)) {
expandAtomicLoadToLibcall(LI);		expandAtomicLoadToLibcall(LI);
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}
} else if (SI) {		} else if (SI) {
if (!atomicSizeSupported(TLI, SI)) {		if (TLI->shouldExpandAtomicToLibCall(getAtomicOpSize(SI),
		getAtomicOpAlign(SI), SI)) {
expandAtomicStoreToLibcall(SI);		expandAtomicStoreToLibcall(SI);
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}
} else if (RMWI) {		} else if (RMWI) {
if (!atomicSizeSupported(TLI, RMWI)) {		if (TLI->shouldExpandAtomicToLibCall(getAtomicOpSize(RMWI),
		getAtomicOpAlign(RMWI), RMWI)) {
expandAtomicRMWToLibcall(RMWI);		expandAtomicRMWToLibcall(RMWI);
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}
} else if (CASI) {		} else if (CASI) {
if (!atomicSizeSupported(TLI, CASI)) {		if (TLI->shouldExpandAtomicToLibCall(getAtomicOpSize(CASI),
		getAtomicOpAlign(CASI), CASI)) {
expandAtomicCASToLibcall(CASI);		expandAtomicCASToLibcall(CASI);
MadeChange = true;		MadeChange = true;
continue;		continue;
}		}
}		}

if (TLI->shouldInsertFencesForAtomic(I)) {		if (TLI->shouldInsertFencesForAtomic(I)) {
auto FenceOrdering = AtomicOrdering::Monotonic;		auto FenceOrdering = AtomicOrdering::Monotonic;
▲ Show 20 Lines • Show All 1,394 Lines • Show Last 20 Lines