Index: docs/Atomics.rst =================================================================== --- docs/Atomics.rst +++ docs/Atomics.rst @@ -417,14 +417,23 @@ this is not correct in the IR sense of volatile, but CodeGen handles anything marked volatile very conservatively. This should get fixed at some point. -Common architectures have some way of representing at least a pointer-sized -lock-free ``cmpxchg``; such an operation can be used to implement all the other -atomic operations which can be represented in IR up to that size. Backends are -expected to implement all those operations, but not operations which cannot be -implemented in a lock-free manner. It is expected that backends will give an -error when given an operation which cannot be implemented. (The LLVM code -generator is not very helpful here at the moment, but hopefully that will -change.) +One very important property of the atomic operations is that if your backend +supports any inline lock-free atomic operations of a given size, you should +support *ALL* operations of that size in a lock-free manner. + +When the target implements atomic cmpxchg or LL/SC instructions (as most do) +this is trivial: all the other operations can be implemented on top of those +primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there +are atomic load and store instructions, but no cmpxchg or LL/SC. As it is +invalid to implement ``atomic load`` using the native instruction, but +``cmpxchg`` using a library call to a function that uses a mutex, ``atomic +load`` must *also* expand to a library call on such architectures, so that it +can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same +mutex. + +AtomicExpandPass can help with that: it will expand all atomic operations to the +proper ``__atomic_*`` libcalls for any size above the maximum set by +``setMaxAtomicSizeSupported`` (which defaults to 0). On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent @@ -450,10 +459,149 @@ ``emitStoreConditional()`` * large loads/stores -> ll-sc/cmpxchg by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()`` -* strong atomic accesses -> monotonic accesses + fences - by using ``setInsertFencesForAtomic()`` and overriding ``emitLeadingFence()`` - and ``emitTrailingFence()`` +* strong atomic accesses -> monotonic accesses + fences by overriding + ``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and + ``emitTrailingFence()`` * atomic rmw -> loop with cmpxchg or load-linked/store-conditional by overriding ``expandAtomicRMWInIR()`` +* expansion to __atomic_* libcalls for unsupported sizes. For an example of all of these, look at the ARM backend. + +Libcalls: __atomic_* +==================== + +There are two kinds of atomic library calls that are generated by LLVM. Please +note that both sets of library functions somewhat confusingly share the names of +builtin functions defined by clang. Despite this, the library functions are +not directly related to the builtins: it is *not* the case that ``__atomic_*`` +builtins lower to ``__atomic_*`` library calls and ``__sync_*`` builtins lower +to ``__sync_*`` library calls. + +The first set of library functions are named ``__atomic_*``. This set has been +"standardized" by GCC, and is described below. (See also `GCC's documentation +`_) + +LLVM's AtomicExpandPass will translate atomic operations on data sizes above +``MaxAtomicSizeSupported`` into calls to these functions. + +There are four generic functions, which can be called with data of any size or +alignment:: + + void __atomic_load(size_t size, void *ptr, void *ret, int ordering) + void __atomic_store(size_t size, void *ptr, void *val, int ordering) + void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, int ordering) + bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, void *desired, int success_order, int failure_order) + +There are also size-specialized versions of the above functions, which can only +be used with *naturally-aligned* pointers of the appropriate size. In the +signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate +integer type of that size; if no such integer type exists, the specialization +cannot be used:: + + iN __atomic_load_N(iN *ptr, iN val, int ordering) + void __atomic_store_N(iN *ptr, iN val, int ordering) + iN __atomic_exchange_N(iN *ptr, iN val, int ordering) + bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, int success_order, int failure_order) + +Finally there are some read-modify-write functions, which are only available in +the size-specific variants (any other sizes use a ``__atomic_compare_exchange`` +loop):: + + iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering) + iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering) + iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering) + iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering) + iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering) + iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering) + +This set of library functions have some interesting implementation requirements +to take note of: + +- They support all sizes and alignments -- including those which cannot be + implemented natively on any existing hardware. Therefore, they will certainly + use mutexes in for some sizes/alignments. + +- As a consequence, they cannot be shipped in a statically linked + compiler-support library, as they have state which must be shared amongst all + DSOs loaded in the program. They must be provided in a shared library used by + all objects. + +- The set of atomic sizes supported lock-free must be a superset of the sizes + any compiler can emit. That is: if a new compiler introduces support for + inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a + lock-free implementation for size N. This is a requirement so that code + produced by an old compiler (which will have called the ``__atomic_*`` function) + interoperates with code produced by the new compiler (which will use native + the atomic instruction). + +Note that it's possible to write an entirely target-independent implementation +of these library functions by using the compiler atomic builtins themselves to +implement the operations on naturally-aligned pointers of supported sizes, and a +generic mutex implementation otherwise. + +Libcalls: __sync_* +================== + +Some targets or OS/target combinations can support lock-free atomics, but for +various reasons, it is not practical to emit the instructions inline. + +There's two typical examples of this. + +Some CPUs support multiple instruction sets which can be swiched back and forth +on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which is +has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly, +has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic +instructions are not encodable. However, those instructions are available via a +function call to a function with the longer encoding. + +Additionally, a few OS/target pairs provide kernel-supported lock-free +atomics. ARM/Linux is an example of this: the kernel provides a function which +on older CPUs contains a "magically-restartable" atomic sequence (which looks +atomic so long as there's only one CPU), and contains actual atomic instructions +on newer multicore models. This sort of functionality can typically be provided +on any architecture, if all CPUs which are missing cmpxchg support are +uniprocessor (no SMP). This is almost always the case. The only common +architecture without that property is SPARC -- SPARCV8 SMP systems were common, +yet it doesn't support cmpxchg. + +In either of these cases, the Target in LLVM can claim support for atomics of an +appropriate size, and then implement some subset of the operations via libcalls +to a ``__sync_*`` function. Such functions *must* not use locks in their +implementation, because unlike the ``__atomic_*`` routines used by +AtomicExpandPass, these may be mixed-and-matched with native instructions by the +target lowering. + +Further, these routines do not need to be shared, as they are stateless. So, +there is no issue with having multiple copies included in one binary. Thus, +typically these routines are implemented by the statically-linked compiler +runtime support library. + +LLVM will emit a call to an appropriate ``__sync_*`` routine if the target +ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``, +or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the +availablity of those library functions via a call to ``initSyncLibcalls()``. + +The full set of functions that may be called by LLVM is (for ``N`` being 1, 2, +4, 8, or 16):: + + iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired) + iN __sync_lock_test_and_set_N(iN *ptr, iN val) + iN __sync_fetch_and_add_N(iN *ptr, iN val) + iN __sync_fetch_and_sub_N(iN *ptr, iN val) + iN __sync_fetch_and_and_N(iN *ptr, iN val) + iN __sync_fetch_and_or_N(iN *ptr, iN val) + iN __sync_fetch_and_xor_N(iN *ptr, iN val) + iN __sync_fetch_and_nand_N(iN *ptr, iN val) + iN __sync_fetch_and_max_N(iN *ptr, iN val) + iN __sync_fetch_and_umax_N(iN *ptr, iN val) + iN __sync_fetch_and_min_N(iN *ptr, iN val) + iN __sync_fetch_and_umin_N(iN *ptr, iN val) + +This list doesn't include any function for atomic load or store; all known +architectures support atomic loads and stores directly (possibly by emitting a +fence on either side of a normal load or store.) + +There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to +``__sync_synchronize()``. This may happen or not happen independent of all the +above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``. Index: include/llvm/CodeGen/RuntimeLibcalls.h =================================================================== --- include/llvm/CodeGen/RuntimeLibcalls.h +++ include/llvm/CodeGen/RuntimeLibcalls.h @@ -398,6 +398,73 @@ SYNC_FETCH_AND_UMIN_8, SYNC_FETCH_AND_UMIN_16, + // New style atomics. + ATOMIC_LOAD, + ATOMIC_LOAD_1, + ATOMIC_LOAD_2, + ATOMIC_LOAD_4, + ATOMIC_LOAD_8, + ATOMIC_LOAD_16, + + ATOMIC_STORE, + ATOMIC_STORE_1, + ATOMIC_STORE_2, + ATOMIC_STORE_4, + ATOMIC_STORE_8, + ATOMIC_STORE_16, + + ATOMIC_EXCHANGE, + ATOMIC_EXCHANGE_1, + ATOMIC_EXCHANGE_2, + ATOMIC_EXCHANGE_4, + ATOMIC_EXCHANGE_8, + ATOMIC_EXCHANGE_16, + + ATOMIC_COMPARE_EXCHANGE, + ATOMIC_COMPARE_EXCHANGE_1, + ATOMIC_COMPARE_EXCHANGE_2, + ATOMIC_COMPARE_EXCHANGE_4, + ATOMIC_COMPARE_EXCHANGE_8, + ATOMIC_COMPARE_EXCHANGE_16, + + ATOMIC_FETCH_ADD_1, + ATOMIC_FETCH_ADD_2, + ATOMIC_FETCH_ADD_4, + ATOMIC_FETCH_ADD_8, + ATOMIC_FETCH_ADD_16, + + ATOMIC_FETCH_SUB_1, + ATOMIC_FETCH_SUB_2, + ATOMIC_FETCH_SUB_4, + ATOMIC_FETCH_SUB_8, + ATOMIC_FETCH_SUB_16, + + ATOMIC_FETCH_AND_1, + ATOMIC_FETCH_AND_2, + ATOMIC_FETCH_AND_4, + ATOMIC_FETCH_AND_8, + ATOMIC_FETCH_AND_16, + + ATOMIC_FETCH_OR_1, + ATOMIC_FETCH_OR_2, + ATOMIC_FETCH_OR_4, + ATOMIC_FETCH_OR_8, + ATOMIC_FETCH_OR_16, + + ATOMIC_FETCH_XOR_1, + ATOMIC_FETCH_XOR_2, + ATOMIC_FETCH_XOR_4, + ATOMIC_FETCH_XOR_8, + ATOMIC_FETCH_XOR_16, + + ATOMIC_FETCH_NAND_1, + ATOMIC_FETCH_NAND_2, + ATOMIC_FETCH_NAND_4, + ATOMIC_FETCH_NAND_8, + ATOMIC_FETCH_NAND_16, + + ATOMIC_IS_LOCK_FREE, + // Stack Protector Fail. STACKPROTECTOR_CHECK_FAIL, @@ -430,7 +497,7 @@ /// Return the SYNC_FETCH_AND_* value for the given opcode and type, or /// UNKNOWN_LIBCALL if there is none. - Libcall getATOMIC(unsigned Opc, MVT VT); + Libcall getSYNC(unsigned Opc, MVT VT); } } Index: include/llvm/Target/TargetLowering.h =================================================================== --- include/llvm/Target/TargetLowering.h +++ include/llvm/Target/TargetLowering.h @@ -1003,12 +1003,6 @@ return PrefLoopAlignment; } - /// Return whether the DAG builder should automatically insert fences and - /// reduce ordering for atomics. - bool getInsertFencesForAtomic() const { - return InsertFencesForAtomic; - } - /// Return true if the target stores stack protector cookies at a fixed offset /// in some non-standard address space, and populates the address space and /// offset as appropriate. @@ -1052,6 +1046,19 @@ /// \name Helpers for atomic expansion. /// @{ + /// Returns the maximum atomic operation size (in bits) supported by + /// the backend. Atomic operations greater than this size (as well + /// as ones that are not naturally aligned), will be expanded by + /// AtomicExpandPass into an __atomic_* library call. + unsigned getMaxAtomicSizeSupported() const { return MaxAtomicSizeSupported; } + + /// Whether the DAG builder should automatically insert fences and + /// reduce ordering for this atomic. This should be true for + /// most architectures with weak memory ordering. Defaults to false. + virtual bool shouldInsertFencesForAtomic(const Instruction *I) const { + return false; + } + /// Perform a load-linked operation on Addr, returning a "Value *" with the /// corresponding pointee type. This may entail some non-trivial operations to /// truncate or reconstruct types that will be illegal in the backend. See @@ -1070,12 +1077,12 @@ /// Inserts in the IR a target-specific intrinsic specifying a fence. /// It is called by AtomicExpandPass before expanding an - /// AtomicRMW/AtomicCmpXchg/AtomicStore/AtomicLoad. + /// AtomicRMW/AtomicCmpXchg/AtomicStore/AtomicLoad + /// if shouldInsertFencesForAtomic returns true. /// RMW and CmpXchg set both IsStore and IsLoad to true. /// This function should either return a nullptr, or a pointer to an IR-level /// Instruction*. Even complex fence sequences can be represented by a /// single Instruction* through an intrinsic to be lowered later. - /// Backends with !getInsertFencesForAtomic() should keep a no-op here. /// Backends should override this method to produce target-specific intrinsic /// for their fences. /// FIXME: Please note that the default implementation here in terms of @@ -1101,9 +1108,6 @@ virtual Instruction *emitLeadingFence(IRBuilder<> &Builder, AtomicOrdering Ord, bool IsStore, bool IsLoad) const { - if (!getInsertFencesForAtomic()) - return nullptr; - if (isAtLeastRelease(Ord) && IsStore) return Builder.CreateFence(Ord); else @@ -1113,9 +1117,6 @@ virtual Instruction *emitTrailingFence(IRBuilder<> &Builder, AtomicOrdering Ord, bool IsStore, bool IsLoad) const { - if (!getInsertFencesForAtomic()) - return nullptr; - if (isAtLeastAcquire(Ord)) return Builder.CreateFence(Ord); else @@ -1441,10 +1442,12 @@ MinStackArgumentAlignment = Align; } - /// Set if the DAG builder should automatically insert fences and reduce the - /// order of atomic memory operations to Monotonic. - void setInsertFencesForAtomic(bool fence) { - InsertFencesForAtomic = fence; + /// Set the maximum atomic operation size supported by the + /// backend. Atomic operations greater than this size (as well as + /// ones that are not naturally aligned), will be expanded by + /// AtomicExpandPass into an __atomic_* library call. + void setMaxAtomicSizeSupported(unsigned Size) { + MaxAtomicSizeSupported = Size; } public: @@ -1856,10 +1859,9 @@ /// The preferred loop alignment. unsigned PrefLoopAlignment; - /// Whether the DAG builder should automatically insert fences and reduce - /// ordering for atomics. (This will be set for for most architectures with - /// weak memory ordering.) - bool InsertFencesForAtomic; + /// Size in bits of the maximum atomics size the backend supports. + /// Accesses larger than this will be expanded by AtomicExpandPass. + unsigned MaxAtomicSizeSupported; /// If set to a physical register, this specifies the register that /// llvm.savestack/llvm.restorestack should save and restore. Index: lib/CodeGen/AtomicExpandPass.cpp =================================================================== --- lib/CodeGen/AtomicExpandPass.cpp +++ lib/CodeGen/AtomicExpandPass.cpp @@ -8,10 +8,10 @@ //===----------------------------------------------------------------------===// // // This file contains a pass (at IR level) to replace atomic instructions with -// target specific instruction which implement the same semantics in a way -// which better fits the target backend. This can include the use of either -// (intrinsic-based) load-linked/store-conditional loops, AtomicCmpXchg, or -// type coercions. +// __atomic_* library calls, or target specific instruction which implement the +// same semantics in a way which better fits the target backend. This can +// include the use of (intrinsic-based) load-linked/store-conditional loops, +// AtomicCmpXchg, or type coercions. // //===----------------------------------------------------------------------===// @@ -64,19 +64,93 @@ bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI); bool isIdempotentRMW(AtomicRMWInst *AI); bool simplifyIdempotentRMW(AtomicRMWInst *AI); + + bool expandAtomicOpToLibcall(Instruction *I, unsigned Size, unsigned Align, + Value *PointerOperand, Value *ValueOperand, + Value *CASExpected, AtomicOrdering Ordering, + AtomicOrdering Ordering2, + const RTLIB::Libcall *Libcalls); + void expandAtomicLoadToLibcall(LoadInst *LI); + void expandAtomicStoreToLibcall(StoreInst *LI); + void expandAtomicRMWToLibcall(AtomicRMWInst *I); + void expandAtomicCASToLibcall(AtomicCmpXchgInst *I); }; } char AtomicExpand::ID = 0; char &llvm::AtomicExpandID = AtomicExpand::ID; -INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand", - "Expand Atomic calls in terms of either load-linked & store-conditional or cmpxchg", - false, false) +INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand", "Expand Atomic instructions", + false, false) FunctionPass *llvm::createAtomicExpandPass(const TargetMachine *TM) { return new AtomicExpand(TM); } +namespace { +// Helper functions to retrieve the size of atomic instructions. +unsigned getAtomicOpSize(LoadInst *LI) { + const DataLayout &DL = LI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(LI->getType()); +} + +unsigned getAtomicOpSize(StoreInst *SI) { + const DataLayout &DL = SI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(SI->getValueOperand()->getType()); +} + +unsigned getAtomicOpSize(AtomicRMWInst *RMWI) { + const DataLayout &DL = RMWI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(RMWI->getValOperand()->getType()); +} + +unsigned getAtomicOpSize(AtomicCmpXchgInst *CASI) { + const DataLayout &DL = CASI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(CASI->getCompareOperand()->getType()); +} + +// Helper functions to retrieve the alignment of atomic instructions. +unsigned getAtomicOpAlign(LoadInst *LI) { + const DataLayout &DL = LI->getModule()->getDataLayout(); + unsigned Align = LI->getAlignment(); + if (Align == 0) + return DL.getABITypeAlignment(LI->getType()); + return Align; +} + +unsigned getAtomicOpAlign(StoreInst *SI) { + const DataLayout &DL = SI->getModule()->getDataLayout(); + unsigned Align = SI->getAlignment(); + if (Align == 0) + return DL.getABITypeAlignment(SI->getValueOperand()->getType()); + return Align; +} + +unsigned getAtomicOpAlign(AtomicRMWInst *RMWI) { + // TODO: This instruction has no alignment attribute, but unlike the + // default alignment for load/store, the default here is to assume + // it has NATURAL alignment, not DataLayout-specified alignment. + const DataLayout &DL = RMWI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(RMWI->getValOperand()->getType()); +} + +unsigned getAtomicOpAlign(AtomicCmpXchgInst *CASI) { + // TODO: same comment as above. + const DataLayout &DL = CASI->getModule()->getDataLayout(); + return DL.getTypeStoreSize(CASI->getCompareOperand()->getType()); +} + +// Determine if a particular atomic operation has a supported size, +// and is of appropriate alignment, to be passed through for target +// lowering. (Versus turning into a __atomic libcall) +template +bool atomicSizeSupported(const TargetLowering *TLI, Inst *I) { + unsigned Size = getAtomicOpSize(I); + unsigned Align = getAtomicOpAlign(I); + return Align >= Size && Size <= TLI->getMaxAtomicSizeSupported() / 8; +} + +} // end anonymous namespace + bool AtomicExpand::runOnFunction(Function &F) { if (!TM || !TM->getSubtargetImpl(F)->enableAtomicExpand()) return false; @@ -93,14 +167,43 @@ bool MadeChange = false; for (auto I : AtomicInsts) { + if (isa(I)) + continue; + auto LI = dyn_cast(I); auto SI = dyn_cast(I); auto RMWI = dyn_cast(I); auto CASI = dyn_cast(I); - assert((LI || SI || RMWI || CASI || isa(I)) && - "Unknown atomic instruction"); + assert((LI || SI || RMWI || CASI) && "Unknown atomic instruction"); - if (TLI->getInsertFencesForAtomic()) { + // If the Size/Alignment is not supported, replace with a libcall. + if (LI) { + if (!atomicSizeSupported(TLI, LI)) { + expandAtomicLoadToLibcall(LI); + MadeChange = true; + continue; + } + } else if (SI) { + if (!atomicSizeSupported(TLI, SI)) { + expandAtomicStoreToLibcall(SI); + MadeChange = true; + continue; + } + } else if (RMWI) { + if (!atomicSizeSupported(TLI, RMWI)) { + expandAtomicRMWToLibcall(RMWI); + MadeChange = true; + continue; + } + } else if (CASI) { + if (!atomicSizeSupported(TLI, CASI)) { + expandAtomicCASToLibcall(CASI); + MadeChange = true; + continue; + } + } + + if (TLI->shouldInsertFencesForAtomic(I)) { auto FenceOrdering = Monotonic; bool IsStore, IsLoad; if (LI && isAtLeastAcquire(LI->getOrdering())) { @@ -144,7 +247,7 @@ assert(LI->getType()->isIntegerTy() && "invariant broken"); MadeChange = true; } - + MadeChange |= tryExpandAtomicLoad(LI); } else if (SI) { if (SI->getValueOperand()->getType()->isFloatingPointTy()) { @@ -514,12 +617,14 @@ BasicBlock *BB = CI->getParent(); Function *F = BB->getParent(); LLVMContext &Ctx = F->getContext(); - // If getInsertFencesForAtomic() returns true, then the target does not want + // If shouldInsertFencesForAtomic() returns true, then the target does not + // want // to deal with memory orders, and emitLeading/TrailingFence should take care // of everything. Otherwise, emitLeading/TrailingFence are no-op and we // should preserve the ordering. + bool ShouldInsertFencesForAtomic = TLI->shouldInsertFencesForAtomic(CI); AtomicOrdering MemOpOrder = - TLI->getInsertFencesForAtomic() ? Monotonic : SuccessOrder; + ShouldInsertFencesForAtomic ? Monotonic : SuccessOrder; // In implementations which use a barrier to achieve release semantics, we can // delay emitting this barrier until we know a store is actually going to be @@ -530,7 +635,7 @@ // since in other cases the extra blocks naturally collapse down to the // minimal loop. Unfortunately, this puts too much stress on later // optimisations so we avoid emitting the extra logic in those cases too. - bool HasReleasedLoadBB = !CI->isWeak() && TLI->getInsertFencesForAtomic() && + bool HasReleasedLoadBB = !CI->isWeak() && ShouldInsertFencesForAtomic && SuccessOrder != Monotonic && SuccessOrder != Acquire && !F->optForMinSize(); @@ -601,7 +706,7 @@ // the branch entirely. std::prev(BB->end())->eraseFromParent(); Builder.SetInsertPoint(BB); - if (UseUnconditionalReleaseBarrier) + if (ShouldInsertFencesForAtomic && UseUnconditionalReleaseBarrier) TLI->emitLeadingFence(Builder, SuccessOrder, /*IsStore=*/true, /*IsLoad=*/true); Builder.CreateBr(StartBB); @@ -617,7 +722,7 @@ Builder.CreateCondBr(ShouldStore, ReleasingStoreBB, NoStoreBB); Builder.SetInsertPoint(ReleasingStoreBB); - if (!UseUnconditionalReleaseBarrier) + if (ShouldInsertFencesForAtomic && !UseUnconditionalReleaseBarrier) TLI->emitLeadingFence(Builder, SuccessOrder, /*IsStore=*/true, /*IsLoad=*/true); Builder.CreateBr(TryStoreBB); @@ -647,8 +752,9 @@ // Make sure later instructions don't get reordered with a fence if // necessary. Builder.SetInsertPoint(SuccessBB); - TLI->emitTrailingFence(Builder, SuccessOrder, /*IsStore=*/true, - /*IsLoad=*/true); + if (ShouldInsertFencesForAtomic) + TLI->emitTrailingFence(Builder, SuccessOrder, /*IsStore=*/true, + /*IsLoad=*/true); Builder.CreateBr(ExitBB); Builder.SetInsertPoint(NoStoreBB); @@ -659,8 +765,9 @@ Builder.CreateBr(FailureBB); Builder.SetInsertPoint(FailureBB); - TLI->emitTrailingFence(Builder, FailureOrder, /*IsStore=*/true, - /*IsLoad=*/true); + if (ShouldInsertFencesForAtomic) + TLI->emitTrailingFence(Builder, FailureOrder, /*IsStore=*/true, + /*IsLoad=*/true); Builder.CreateBr(ExitBB); // Finally, we have control-flow based knowledge of whether the cmpxchg @@ -828,3 +935,385 @@ return true; } + +// This converts from LLVM's internal AtomicOrdering enum to the +// memory_order_* value required by the __atomic_* libcalls. +static int libcallAtomicModel(AtomicOrdering AO) { + switch (AO) { + case NotAtomic: + llvm_unreachable("Expected atomic memory order."); + case Unordered: + case Monotonic: + return 0; // memory_order_relaxed + // Not implemented yet in llvm: + // case Consume: + // return 1; // memory_order_consume + case Acquire: + return 2; // memory_order_acquire + case Release: + return 3; // memory_order_release + case AcquireRelease: + return 4; // memory_order_acq_rel + case SequentiallyConsistent: + return 5; // memory_order_seq_cst + } + llvm_unreachable("Unknown atomic memory order."); +} + +// In order to use one of the sized library calls such as +// __atomic_fetch_add_4, the alignment must be sufficient, the size +// must be one of the potentially-specialized sizes, and the value +// type must actually exist in C on the target (otherwise, the +// function wouldn't actually be defined.) +static bool canUseSizedAtomicCall(unsigned Size, unsigned Align, + const DataLayout &DL) { + // TODO: "LargestSize" is an approximation for "largest type that + // you can express in C". It seems to be the case that int128 is + // supported on all 64-bit platforms, otherwise only up to 64-bit + // integers are supported. If we get this wrong, then we'll try to + // call a sized libcall that doesn't actually exist. There should + // really be some more reliable way in LLVM of determining integer + // sizes which are valid in the target's C ABI... + unsigned LargestSize = DL.getLargestLegalIntTypeSize() >= 64 ? 16 : 8; + return Align >= Size && + (Size == 1 || Size == 2 || Size == 4 || Size == 8 || Size == 16) && + Size <= LargestSize; +} + +void AtomicExpand::expandAtomicLoadToLibcall(LoadInst *I) { + static const RTLIB::Libcall Libcalls[6] = { + RTLIB::ATOMIC_LOAD, RTLIB::ATOMIC_LOAD_1, RTLIB::ATOMIC_LOAD_2, + RTLIB::ATOMIC_LOAD_4, RTLIB::ATOMIC_LOAD_8, RTLIB::ATOMIC_LOAD_16}; + unsigned Size = getAtomicOpSize(I); + unsigned Align = getAtomicOpAlign(I); + + if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(), nullptr, + nullptr, I->getOrdering(), + AtomicOrdering::NotAtomic, Libcalls)) + llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor Load"); +} + +void AtomicExpand::expandAtomicStoreToLibcall(StoreInst *I) { + static const RTLIB::Libcall Libcalls[6] = { + RTLIB::ATOMIC_STORE, RTLIB::ATOMIC_STORE_1, RTLIB::ATOMIC_STORE_2, + RTLIB::ATOMIC_STORE_4, RTLIB::ATOMIC_STORE_8, RTLIB::ATOMIC_STORE_16}; + unsigned Size = getAtomicOpSize(I); + unsigned Align = getAtomicOpAlign(I); + + if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(), + I->getValueOperand(), nullptr, I->getOrdering(), + AtomicOrdering::NotAtomic, Libcalls)) + llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor Store"); +} + +void AtomicExpand::expandAtomicCASToLibcall(AtomicCmpXchgInst *I) { + static const RTLIB::Libcall Libcalls[6] = { + RTLIB::ATOMIC_COMPARE_EXCHANGE, RTLIB::ATOMIC_COMPARE_EXCHANGE_1, + RTLIB::ATOMIC_COMPARE_EXCHANGE_2, RTLIB::ATOMIC_COMPARE_EXCHANGE_4, + RTLIB::ATOMIC_COMPARE_EXCHANGE_8, RTLIB::ATOMIC_COMPARE_EXCHANGE_16}; + unsigned Size = getAtomicOpSize(I); + unsigned Align = getAtomicOpAlign(I); + + if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(), + I->getNewValOperand(), I->getCompareOperand(), + I->getSuccessOrdering(), I->getFailureOrdering(), + Libcalls)) + llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor CAS"); +} + +void AtomicExpand::expandAtomicRMWToLibcall(AtomicRMWInst *I) { + static const RTLIB::Libcall LibcallsXchg[6] = { + RTLIB::ATOMIC_EXCHANGE, RTLIB::ATOMIC_EXCHANGE_1, + RTLIB::ATOMIC_EXCHANGE_2, RTLIB::ATOMIC_EXCHANGE_4, + RTLIB::ATOMIC_EXCHANGE_8, RTLIB::ATOMIC_EXCHANGE_16}; + static const RTLIB::Libcall LibcallsAdd[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_ADD_1, + RTLIB::ATOMIC_FETCH_ADD_2, RTLIB::ATOMIC_FETCH_ADD_4, + RTLIB::ATOMIC_FETCH_ADD_8, RTLIB::ATOMIC_FETCH_ADD_16}; + static const RTLIB::Libcall LibcallsSub[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_SUB_1, + RTLIB::ATOMIC_FETCH_SUB_2, RTLIB::ATOMIC_FETCH_SUB_4, + RTLIB::ATOMIC_FETCH_SUB_8, RTLIB::ATOMIC_FETCH_SUB_16}; + static const RTLIB::Libcall LibcallsAnd[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_AND_1, + RTLIB::ATOMIC_FETCH_AND_2, RTLIB::ATOMIC_FETCH_AND_4, + RTLIB::ATOMIC_FETCH_AND_8, RTLIB::ATOMIC_FETCH_AND_16}; + static const RTLIB::Libcall LibcallsOr[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_OR_1, + RTLIB::ATOMIC_FETCH_OR_2, RTLIB::ATOMIC_FETCH_OR_4, + RTLIB::ATOMIC_FETCH_OR_8, RTLIB::ATOMIC_FETCH_OR_16}; + static const RTLIB::Libcall LibcallsXor[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_XOR_1, + RTLIB::ATOMIC_FETCH_XOR_2, RTLIB::ATOMIC_FETCH_XOR_4, + RTLIB::ATOMIC_FETCH_XOR_8, RTLIB::ATOMIC_FETCH_XOR_16}; + static const RTLIB::Libcall LibcallsNand[6] = { + RTLIB::UNKNOWN_LIBCALL, RTLIB::ATOMIC_FETCH_NAND_1, + RTLIB::ATOMIC_FETCH_NAND_2, RTLIB::ATOMIC_FETCH_NAND_4, + RTLIB::ATOMIC_FETCH_NAND_8, RTLIB::ATOMIC_FETCH_NAND_16}; + + const RTLIB::Libcall *Libcalls; + switch (I->getOperation()) { + case AtomicRMWInst::Xchg: + Libcalls = LibcallsXchg; + break; + case AtomicRMWInst::Add: + Libcalls = LibcallsAdd; + break; + case AtomicRMWInst::Sub: + Libcalls = LibcallsSub; + break; + case AtomicRMWInst::And: + Libcalls = LibcallsAnd; + break; + case AtomicRMWInst::Or: + Libcalls = LibcallsOr; + break; + case AtomicRMWInst::Xor: + Libcalls = LibcallsXor; + break; + case AtomicRMWInst::Nand: + Libcalls = LibcallsNand; + break; + case AtomicRMWInst::Max: + case AtomicRMWInst::Min: + case AtomicRMWInst::UMax: + case AtomicRMWInst::UMin: + // No atomic libcalls are available for max/min/umax/umin. + Libcalls = nullptr; + break; + default: + llvm_unreachable("Unexpected RMW operation"); + } + + unsigned Size = getAtomicOpSize(I); + unsigned Align = getAtomicOpAlign(I); + + bool Success = Libcalls && expandAtomicOpToLibcall( + I, Size, Align, I->getPointerOperand(), + I->getValOperand(), nullptr, I->getOrdering(), + AtomicOrdering::NotAtomic, Libcalls); + + // The expansion failed: either there were no libcalls at all for + // the operation (min/max), or there were only size-specialized + // libcalls (add/sub/etc) and we needed a generic. So, expand to a + // CAS loop instead. + if (!Success) { + expandAtomicRMWToCmpXchg(I, [this](IRBuilder<> &Builder, Value *Addr, + Value *Loaded, Value *NewVal, + AtomicOrdering MemOpOrder, + Value *&Success, Value *&NewLoaded) { + // Create the CAS instruction normally... + AtomicCmpXchgInst *Pair = Builder.CreateAtomicCmpXchg( + Addr, Loaded, NewVal, MemOpOrder, + AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder)); + Success = Builder.CreateExtractValue(Pair, 1, "success"); + NewLoaded = Builder.CreateExtractValue(Pair, 0, "newloaded"); + + // ...and then expand the CAS into a libcall. + expandAtomicCASToLibcall(Pair); + }); + } +} + +// A helper routine for the above expandAtomic*ToLibcall functions. +// +// 'Libcalls' contains an array of enum values for the particular +// ATOMIC libcalls to be emitted. All of the other arguments besides +// 'I' are extracted from the Instruction subclass by the +// caller. Depending on the particular call, some will be null. +bool AtomicExpand::expandAtomicOpToLibcall( + Instruction *I, unsigned Size, unsigned Align, Value *PointerOperand, + Value *ValueOperand, Value *CASExpected, AtomicOrdering Ordering, + AtomicOrdering Ordering2, const RTLIB::Libcall *Libcalls) { + LLVMContext &Ctx = I->getContext(); + Module *M = I->getModule(); + const DataLayout &DL = M->getDataLayout(); + IRBuilder<> Builder(I); + IRBuilder<> AllocaBuilder(&I->getFunction()->getEntryBlock().front()); + + unsigned AllocaAlignment = std::min(Size, 16u); + bool UseSizedLibcall = canUseSizedAtomicCall(Size, Align, DL); + + Type *SizedIntTy = Type::getIntNTy(Ctx, Size * 8); + + // TODO: the "order" argument type is "int", not int32. So + // getInt32Ty may be wrong if the arch uses e.g. 16-bit ints. + ConstantInt *SizeVal64 = ConstantInt::get(Type::getInt64Ty(Ctx), Size); + Constant *OrderingVal = + ConstantInt::get(Type::getInt32Ty(Ctx), libcallAtomicModel(Ordering)); + Constant *Ordering2Val = CASExpected + ? ConstantInt::get(Type::getInt32Ty(Ctx), + libcallAtomicModel(Ordering2)) + : nullptr; + bool HasResult = I->getType() != Type::getVoidTy(Ctx); + + RTLIB::Libcall RTLibType; + if (UseSizedLibcall) { + switch (Size) { + case 1: + RTLibType = Libcalls[1]; + break; + case 2: + RTLibType = Libcalls[2]; + break; + case 4: + RTLibType = Libcalls[3]; + break; + case 8: + RTLibType = Libcalls[4]; + break; + case 16: + RTLibType = Libcalls[5]; + break; + } + } else if (Libcalls[0] != RTLIB::UNKNOWN_LIBCALL) { + RTLibType = Libcalls[0]; + } else { + // Can't use sized function, and there's no generic for this + // operation, so give up. + return false; + } + + // Build up the function call. There's two kinds. First, the sized + // variants. These calls are going to be one of the following (with + // N=1,2,4,8,16): + // iN __atomic_load_N(iN *ptr, int ordering) + // void __atomic_store_N(iN *ptr, iN val, int ordering) + // iN __atomic_{exchange|fetch_*}_N(iN *ptr, iN val, int ordering) + // bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, + // int success_order, int failure_order) + // + // Note that these functions can be used for non-integer atomic + // operations, the values just need to be bitcast to integers on the + // way in and out. + // + // And, then, the generic variants. They look like the following: + // void __atomic_load(size_t size, void *ptr, void *ret, int ordering) + // void __atomic_store(size_t size, void *ptr, void *val, int ordering) + // void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, + // int ordering) + // bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, + // void *desired, int success_order, + // int failure_order) + // + // The different signatures are built up depending on the + // 'UseSizedLibcall', 'CASExpected', 'ValueOperand', and 'HasResult' + // variables. + + AllocaInst *AllocaCASExpected = nullptr; + Value *AllocaCASExpected_i8 = nullptr; + AllocaInst *AllocaValue = nullptr; + Value *AllocaValue_i8 = nullptr; + AllocaInst *AllocaResult = nullptr; + Value *AllocaResult_i8 = nullptr; + + Type *ResultTy; + SmallVector Args; + AttributeSet Attr; + + // 'size' argument. + if (!UseSizedLibcall) { + // Note, getIntPtrType is assumed equivalent to size_t. + Args.push_back(ConstantInt::get(DL.getIntPtrType(Ctx), Size)); + } + + // 'ptr' argument. + Value *PtrVal = + Builder.CreateBitCast(PointerOperand, Type::getInt8PtrTy(Ctx)); + Args.push_back(PtrVal); + + // 'expected' argument, if present. + if (CASExpected) { + AllocaCASExpected = + AllocaBuilder.CreateAlloca(CASExpected->getType(), nullptr, ""); + AllocaCASExpected->setAlignment(AllocaAlignment); + AllocaCASExpected_i8 = + Builder.CreateBitCast(AllocaCASExpected, Type::getInt8PtrTy(Ctx)); + Builder.CreateLifetimeStart(AllocaCASExpected_i8, SizeVal64); + Builder.CreateAlignedStore(CASExpected, AllocaCASExpected, AllocaAlignment); + Args.push_back(AllocaCASExpected_i8); + } + + // 'val' argument ('desired' for cas), if present. + if (ValueOperand) { + if (UseSizedLibcall) { + Value *IntValue = + Builder.CreateBitOrPointerCast(ValueOperand, SizedIntTy); + Args.push_back(IntValue); + } else { + AllocaValue = + AllocaBuilder.CreateAlloca(ValueOperand->getType(), nullptr, ""); + AllocaValue->setAlignment(AllocaAlignment); + AllocaValue_i8 = + Builder.CreateBitCast(AllocaValue, Type::getInt8PtrTy(Ctx)); + Builder.CreateLifetimeStart(AllocaValue_i8, SizeVal64); + Builder.CreateAlignedStore(ValueOperand, AllocaValue, AllocaAlignment); + Args.push_back(AllocaValue_i8); + } + } + + // 'ret' argument. + if (!CASExpected && HasResult && !UseSizedLibcall) { + AllocaResult = AllocaBuilder.CreateAlloca(I->getType(), nullptr, ""); + AllocaResult->setAlignment(AllocaAlignment); + AllocaResult_i8 = + Builder.CreateBitCast(AllocaResult, Type::getInt8PtrTy(Ctx)); + Builder.CreateLifetimeStart(AllocaResult_i8, SizeVal64); + Args.push_back(AllocaResult_i8); + } + + // 'ordering' ('success_order' for cas) argument. + Args.push_back(OrderingVal); + + // 'failure_order' argument, if present. + if (Ordering2Val) + Args.push_back(Ordering2Val); + + // Now, the return type. + if (CASExpected) { + ResultTy = Type::getInt1Ty(Ctx); + Attr = Attr.addAttribute(Ctx, AttributeSet::ReturnIndex, Attribute::ZExt); + } else if (HasResult && UseSizedLibcall) + ResultTy = SizedIntTy; + else + ResultTy = Type::getVoidTy(Ctx); + + // Done with setting up arguments and return types, create the call: + SmallVector ArgTys; + for (Value *Arg : Args) + ArgTys.push_back(Arg->getType()); + FunctionType *FnType = FunctionType::get(ResultTy, ArgTys, false); + Constant *LibcallFn = + M->getOrInsertFunction(TLI->getLibcallName(RTLibType), FnType, Attr); + CallInst *Call = Builder.CreateCall(LibcallFn, Args); + Call->setAttributes(Attr); + Value *Result = Call; + + // And then, extract the results... + if (ValueOperand && !UseSizedLibcall) + Builder.CreateLifetimeEnd(AllocaValue_i8, SizeVal64); + + if (CASExpected) { + // The final result from the CAS is {load of 'expected' alloca, bool result + // from call} + Type *FinalResultTy = I->getType(); + Value *V = UndefValue::get(FinalResultTy); + Value *ExpectedOut = + Builder.CreateAlignedLoad(AllocaCASExpected, AllocaAlignment); + Builder.CreateLifetimeEnd(AllocaCASExpected_i8, SizeVal64); + V = Builder.CreateInsertValue(V, ExpectedOut, 0, ""); + V = Builder.CreateInsertValue(V, Result, 1, ""); + I->replaceAllUsesWith(V); + } else if (HasResult) { + Value *V; + if (UseSizedLibcall) + V = Builder.CreateBitOrPointerCast(Result, I->getType()); + else { + V = Builder.CreateAlignedLoad(AllocaResult, AllocaAlignment); + Builder.CreateLifetimeEnd(AllocaResult_i8, SizeVal64); + } + I->replaceAllUsesWith(V); + } + I->eraseFromParent(); + return true; +} Index: lib/CodeGen/SelectionDAG/LegalizeDAG.cpp =================================================================== --- lib/CodeGen/SelectionDAG/LegalizeDAG.cpp +++ lib/CodeGen/SelectionDAG/LegalizeDAG.cpp @@ -4035,7 +4035,7 @@ case ISD::ATOMIC_LOAD_UMAX: case ISD::ATOMIC_CMP_SWAP: { MVT VT = cast(Node)->getMemoryVT().getSimpleVT(); - RTLIB::Libcall LC = RTLIB::getATOMIC(Opc, VT); + RTLIB::Libcall LC = RTLIB::getSYNC(Opc, VT); assert(LC != RTLIB::UNKNOWN_LIBCALL && "Unexpected atomic op or value type!"); std::pair Tmp = ExpandChainLibCall(LC, Node, false); Index: lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp =================================================================== --- lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp +++ lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp @@ -1404,7 +1404,7 @@ std::pair DAGTypeLegalizer::ExpandAtomic(SDNode *Node) { unsigned Opc = Node->getOpcode(); MVT VT = cast(Node)->getMemoryVT().getSimpleVT(); - RTLIB::Libcall LC = RTLIB::getATOMIC(Opc, VT); + RTLIB::Libcall LC = RTLIB::getSYNC(Opc, VT); assert(LC != RTLIB::UNKNOWN_LIBCALL && "Unexpected atomic op or value type!"); return ExpandChainLibCall(LC, Node, false); Index: lib/CodeGen/TargetLoweringBase.cpp =================================================================== --- lib/CodeGen/TargetLoweringBase.cpp +++ lib/CodeGen/TargetLoweringBase.cpp @@ -405,7 +405,66 @@ Names[RTLIB::SYNC_FETCH_AND_UMIN_4] = "__sync_fetch_and_umin_4"; Names[RTLIB::SYNC_FETCH_AND_UMIN_8] = "__sync_fetch_and_umin_8"; Names[RTLIB::SYNC_FETCH_AND_UMIN_16] = "__sync_fetch_and_umin_16"; - + + Names[RTLIB::ATOMIC_LOAD] = "__atomic_load"; + Names[RTLIB::ATOMIC_LOAD_1] = "__atomic_load_1"; + Names[RTLIB::ATOMIC_LOAD_2] = "__atomic_load_2"; + Names[RTLIB::ATOMIC_LOAD_4] = "__atomic_load_4"; + Names[RTLIB::ATOMIC_LOAD_8] = "__atomic_load_8"; + Names[RTLIB::ATOMIC_LOAD_16] = "__atomic_load_16"; + + Names[RTLIB::ATOMIC_STORE] = "__atomic_store"; + Names[RTLIB::ATOMIC_STORE_1] = "__atomic_store_1"; + Names[RTLIB::ATOMIC_STORE_2] = "__atomic_store_2"; + Names[RTLIB::ATOMIC_STORE_4] = "__atomic_store_4"; + Names[RTLIB::ATOMIC_STORE_8] = "__atomic_store_8"; + Names[RTLIB::ATOMIC_STORE_16] = "__atomic_store_16"; + + Names[RTLIB::ATOMIC_EXCHANGE] = "__atomic_exchange"; + Names[RTLIB::ATOMIC_EXCHANGE_1] = "__atomic_exchange_1"; + Names[RTLIB::ATOMIC_EXCHANGE_2] = "__atomic_exchange_2"; + Names[RTLIB::ATOMIC_EXCHANGE_4] = "__atomic_exchange_4"; + Names[RTLIB::ATOMIC_EXCHANGE_8] = "__atomic_exchange_8"; + Names[RTLIB::ATOMIC_EXCHANGE_16] = "__atomic_exchange_16"; + + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE] = "__atomic_compare_exchange"; + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_1] = "__atomic_compare_exchange_1"; + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_2] = "__atomic_compare_exchange_2"; + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_4] = "__atomic_compare_exchange_4"; + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_8] = "__atomic_compare_exchange_8"; + Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_16] = "__atomic_compare_exchange_16"; + + Names[RTLIB::ATOMIC_FETCH_ADD_1] = "__atomic_fetch_add_1"; + Names[RTLIB::ATOMIC_FETCH_ADD_2] = "__atomic_fetch_add_2"; + Names[RTLIB::ATOMIC_FETCH_ADD_4] = "__atomic_fetch_add_4"; + Names[RTLIB::ATOMIC_FETCH_ADD_8] = "__atomic_fetch_add_8"; + Names[RTLIB::ATOMIC_FETCH_ADD_16] = "__atomic_fetch_add_16"; + Names[RTLIB::ATOMIC_FETCH_SUB_1] = "__atomic_fetch_sub_1"; + Names[RTLIB::ATOMIC_FETCH_SUB_2] = "__atomic_fetch_sub_2"; + Names[RTLIB::ATOMIC_FETCH_SUB_4] = "__atomic_fetch_sub_4"; + Names[RTLIB::ATOMIC_FETCH_SUB_8] = "__atomic_fetch_sub_8"; + Names[RTLIB::ATOMIC_FETCH_SUB_16] = "__atomic_fetch_sub_16"; + Names[RTLIB::ATOMIC_FETCH_AND_1] = "__atomic_fetch_and_1"; + Names[RTLIB::ATOMIC_FETCH_AND_2] = "__atomic_fetch_and_2"; + Names[RTLIB::ATOMIC_FETCH_AND_4] = "__atomic_fetch_and_4"; + Names[RTLIB::ATOMIC_FETCH_AND_8] = "__atomic_fetch_and_8"; + Names[RTLIB::ATOMIC_FETCH_AND_16] = "__atomic_fetch_and_16"; + Names[RTLIB::ATOMIC_FETCH_OR_1] = "__atomic_fetch_or_1"; + Names[RTLIB::ATOMIC_FETCH_OR_2] = "__atomic_fetch_or_2"; + Names[RTLIB::ATOMIC_FETCH_OR_4] = "__atomic_fetch_or_4"; + Names[RTLIB::ATOMIC_FETCH_OR_8] = "__atomic_fetch_or_8"; + Names[RTLIB::ATOMIC_FETCH_OR_16] = "__atomic_fetch_or_16"; + Names[RTLIB::ATOMIC_FETCH_XOR_1] = "__atomic_fetch_xor_1"; + Names[RTLIB::ATOMIC_FETCH_XOR_2] = "__atomic_fetch_xor_2"; + Names[RTLIB::ATOMIC_FETCH_XOR_4] = "__atomic_fetch_xor_4"; + Names[RTLIB::ATOMIC_FETCH_XOR_8] = "__atomic_fetch_xor_8"; + Names[RTLIB::ATOMIC_FETCH_XOR_16] = "__atomic_fetch_xor_16"; + Names[RTLIB::ATOMIC_FETCH_NAND_1] = "__atomic_fetch_nand_1"; + Names[RTLIB::ATOMIC_FETCH_NAND_2] = "__atomic_fetch_nand_2"; + Names[RTLIB::ATOMIC_FETCH_NAND_4] = "__atomic_fetch_nand_4"; + Names[RTLIB::ATOMIC_FETCH_NAND_8] = "__atomic_fetch_nand_8"; + Names[RTLIB::ATOMIC_FETCH_NAND_16] = "__atomic_fetch_nand_16"; + if (TT.getEnvironment() == Triple::GNU) { Names[RTLIB::SINCOS_F32] = "sincosf"; Names[RTLIB::SINCOS_F64] = "sincos"; @@ -667,7 +726,7 @@ return UNKNOWN_LIBCALL; } -RTLIB::Libcall RTLIB::getATOMIC(unsigned Opc, MVT VT) { +RTLIB::Libcall RTLIB::getSYNC(unsigned Opc, MVT VT) { #define OP_TO_LIBCALL(Name, Enum) \ case Name: \ switch (VT.SimpleTy) { \ @@ -774,8 +833,10 @@ PrefLoopAlignment = 0; GatherAllAliasesMaxDepth = 6; MinStackArgumentAlignment = 1; - InsertFencesForAtomic = false; MinimumJumpTableEntries = 4; + // TODO: the default will be switched to 0 in the next commit, along + // with the Target-specific changes necessary. + MaxAtomicSizeSupported = 1024; InitLibcallNames(LibcallRoutineNames, TM.getTargetTriple()); InitCmpLibcallCCs(CmpLibcallCCs); Index: lib/Target/ARM/ARMISelLowering.h =================================================================== --- lib/Target/ARM/ARMISelLowering.h +++ lib/Target/ARM/ARMISelLowering.h @@ -453,6 +453,7 @@ bool lowerInterleavedStore(StoreInst *SI, ShuffleVectorInst *SVI, unsigned Factor) const override; + bool shouldInsertFencesForAtomic(const Instruction *I) const override; TargetLoweringBase::AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override; bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override; @@ -486,6 +487,10 @@ /// unsigned ARMPCLabelIndex; + // TODO: remove this, and have shouldInsertFencesForAtomic do the proper + // check. + bool InsertFencesForAtomic; + void addTypeForNEON(MVT VT, MVT PromotedLdStVT, MVT PromotedBitwiseVT); void addDRTypeForNEON(MVT VT); void addQRTypeForNEON(MVT VT); Index: lib/Target/ARM/ARMISelLowering.cpp =================================================================== --- lib/Target/ARM/ARMISelLowering.cpp +++ lib/Target/ARM/ARMISelLowering.cpp @@ -840,6 +840,7 @@ // the default expansion. If we are targeting a single threaded system, // then set them all for expand so we can lower them later into their // non-atomic form. + InsertFencesForAtomic = false; if (TM.Options.ThreadModel == ThreadModel::Single) setOperationAction(ISD::ATOMIC_FENCE, MVT::Other, Expand); else if (Subtarget->hasAnyDataBarrier() && (!Subtarget->isThumb() || @@ -852,7 +853,7 @@ // if they can be combined with nearby atomic loads and stores. if (!Subtarget->hasV8Ops()) { // Automatically insert fences (dmb ish) around ATOMIC_SWAP etc. - setInsertFencesForAtomic(true); + InsertFencesForAtomic = true; } } else { // If there's anything we can use as a barrier, go through custom lowering @@ -11997,9 +11998,6 @@ Instruction* ARMTargetLowering::emitLeadingFence(IRBuilder<> &Builder, AtomicOrdering Ord, bool IsStore, bool IsLoad) const { - if (!getInsertFencesForAtomic()) - return nullptr; - switch (Ord) { case NotAtomic: case Unordered: @@ -12025,9 +12023,6 @@ Instruction* ARMTargetLowering::emitTrailingFence(IRBuilder<> &Builder, AtomicOrdering Ord, bool IsStore, bool IsLoad) const { - if (!getInsertFencesForAtomic()) - return nullptr; - switch (Ord) { case NotAtomic: case Unordered: @@ -12081,6 +12076,11 @@ return true; } +bool ARMTargetLowering::shouldInsertFencesForAtomic( + const Instruction *I) const { + return InsertFencesForAtomic; +} + // This has so far only been implemented for MachO. bool ARMTargetLowering::useLoadStackGuardNode() const { return Subtarget->isTargetMachO(); Index: lib/Target/Hexagon/HexagonISelLowering.cpp =================================================================== --- lib/Target/Hexagon/HexagonISelLowering.cpp +++ lib/Target/Hexagon/HexagonISelLowering.cpp @@ -1724,7 +1724,6 @@ setPrefLoopAlignment(4); setPrefFunctionAlignment(4); setMinFunctionAlignment(2); - setInsertFencesForAtomic(false); setStackPointerRegisterToSaveRestore(HRI.getStackRegister()); if (EnableHexSDNodeSched) Index: lib/Target/Mips/MipsISelLowering.h =================================================================== --- lib/Target/Mips/MipsISelLowering.h +++ lib/Target/Mips/MipsISelLowering.h @@ -561,6 +561,10 @@ unsigned getJumpTableEncoding() const override; bool useSoftFloat() const override; + bool shouldInsertFencesForAtomic(const Instruction *I) const override { + return true; + } + /// Emit a sign-extension using sll/sra, seb, or seh appropriately. MachineBasicBlock *emitSignExtendToI32InReg(MachineInstr *MI, MachineBasicBlock *BB, Index: lib/Target/Mips/MipsISelLowering.cpp =================================================================== --- lib/Target/Mips/MipsISelLowering.cpp +++ lib/Target/Mips/MipsISelLowering.cpp @@ -396,7 +396,6 @@ setOperationAction(ISD::ATOMIC_STORE, MVT::i64, Expand); } - setInsertFencesForAtomic(true); if (!Subtarget.hasMips32r2()) { setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i8, Expand); Index: lib/Target/PowerPC/PPCISelLowering.h =================================================================== --- lib/Target/PowerPC/PPCISelLowering.h +++ lib/Target/PowerPC/PPCISelLowering.h @@ -508,6 +508,10 @@ unsigned getPrefLoopAlignment(MachineLoop *ML) const override; + bool shouldInsertFencesForAtomic(const Instruction *I) const override { + return true; + } + Instruction* emitLeadingFence(IRBuilder<> &Builder, AtomicOrdering Ord, bool IsStore, bool IsLoad) const override; Instruction* emitTrailingFence(IRBuilder<> &Builder, AtomicOrdering Ord, Index: lib/Target/PowerPC/PPCISelLowering.cpp =================================================================== --- lib/Target/PowerPC/PPCISelLowering.cpp +++ lib/Target/PowerPC/PPCISelLowering.cpp @@ -916,7 +916,6 @@ break; } - setInsertFencesForAtomic(true); if (Subtarget.enableMachineScheduler()) setSchedulingPreference(Sched::Source); Index: lib/Target/Sparc/SparcISelLowering.h =================================================================== --- lib/Target/Sparc/SparcISelLowering.h +++ lib/Target/Sparc/SparcISelLowering.h @@ -180,6 +180,13 @@ return VT != MVT::f128; } + bool shouldInsertFencesForAtomic(const Instruction *I) const override { + // FIXME: We insert fences for each atomics and generate + // sub-optimal code for PSO/TSO. (Approximately nobody uses any + // mode but TSO, which makes this even more silly) + return true; + } + void ReplaceNodeResults(SDNode *N, SmallVectorImpl& Results, SelectionDAG &DAG) const override; Index: lib/Target/Sparc/SparcISelLowering.cpp =================================================================== --- lib/Target/Sparc/SparcISelLowering.cpp +++ lib/Target/Sparc/SparcISelLowering.cpp @@ -1603,10 +1603,13 @@ } // ATOMICs. - // FIXME: We insert fences for each atomics and generate sub-optimal code - // for PSO/TSO. Also, implement other atomicrmw operations. - - setInsertFencesForAtomic(true); + // Atomics are only supported on Sparcv9. (32bit atomics are also + // supported by the Leon sparcv8 variant, but we don't support that + // yet.) + if (Subtarget->isV9()) + setMaxAtomicSizeSupported(64); + else + setMaxAtomicSizeSupported(0); setOperationAction(ISD::ATOMIC_SWAP, MVT::i32, Legal); setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Index: lib/Target/XCore/XCoreISelLowering.h =================================================================== --- lib/Target/XCore/XCoreISelLowering.h +++ lib/Target/XCore/XCoreISelLowering.h @@ -229,6 +229,9 @@ bool isVarArg, const SmallVectorImpl &ArgsFlags, LLVMContext &Context) const override; + bool shouldInsertFencesForAtomic(const Instruction *I) const override { + return true; + } }; } Index: lib/Target/XCore/XCoreISelLowering.cpp =================================================================== --- lib/Target/XCore/XCoreISelLowering.cpp +++ lib/Target/XCore/XCoreISelLowering.cpp @@ -156,7 +156,6 @@ // Atomic operations // We request a fence for ATOMIC_* instructions, to reduce them to Monotonic. // As we are always Sequential Consistent, an ATOMIC_FENCE becomes a no OP. - setInsertFencesForAtomic(true); setOperationAction(ISD::ATOMIC_FENCE, MVT::Other, Custom); setOperationAction(ISD::ATOMIC_LOAD, MVT::i32, Custom); setOperationAction(ISD::ATOMIC_STORE, MVT::i32, Custom); Index: test/Transforms/AtomicExpand/SPARC/libcalls.ll =================================================================== --- /dev/null +++ test/Transforms/AtomicExpand/SPARC/libcalls.ll @@ -0,0 +1,217 @@ +; RUN: opt -S %s -atomic-expand | FileCheck %s + +;;; NOTE: this test is actually target-independent -- any target which +;;; doesn't support inline atomics can be used. (E.g. X86 i386 would +;;; work, if LLVM is properly taught about what it's missing vs i586.) + +;target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128" +;target triple = "i386-unknown-unknown" +target datalayout = "e-m:e-p:32:32-i64:64-f128:64-n32-S64" +target triple = "sparc-unknown-unknown" + +;; First, check the sized calls. Except for cmpxchg, these are fairly +;; straightforward. + +; CHECK-LABEL: @test_load_i16( +; CHECK: %1 = bitcast i16* %arg to i8* +; CHECK: %2 = call i16 @__atomic_load_2(i8* %1, i32 5) +; CHECK: ret i16 %2 +define i16 @test_load_i16(i16* %arg) { + %ret = load atomic i16, i16* %arg seq_cst, align 4 + ret i16 %ret +} + +; CHECK-LABEL: @test_store_i16( +; CHECK: %1 = bitcast i16* %arg to i8* +; CHECK: call void @__atomic_store_2(i8* %1, i16 %val, i32 5) +; CHECK: ret void +define void @test_store_i16(i16* %arg, i16 %val) { + store atomic i16 %val, i16* %arg seq_cst, align 4 + ret void +} + +; CHECK-LABEL: @test_exchange_i16( +; CHECK: %1 = bitcast i16* %arg to i8* +; CHECK: %2 = call i16 @__atomic_exchange_2(i8* %1, i16 %val, i32 5) +; CHECK: ret i16 %2 +define i16 @test_exchange_i16(i16* %arg, i16 %val) { + %ret = atomicrmw xchg i16* %arg, i16 %val seq_cst + ret i16 %ret +} + +; CHECK-LABEL: @test_cmpxchg_i16( +; CHECK: %1 = bitcast i16* %arg to i8* +; CHECK: %2 = alloca i16, align 2 +; CHECK: %3 = bitcast i16* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 2, i8* %3) +; CHECK: store i16 %old, i16* %2, align 2 +; CHECK: %4 = call zeroext i1 @__atomic_compare_exchange_2(i8* %1, i8* %3, i16 %new, i32 5, i32 0) +; CHECK: %5 = load i16, i16* %2, align 2 +; CHECK: call void @llvm.lifetime.end(i64 2, i8* %3) +; CHECK: %6 = insertvalue { i16, i1 } undef, i16 %5, 0 +; CHECK: %7 = insertvalue { i16, i1 } %6, i1 %4, 1 +; CHECK: %ret = extractvalue { i16, i1 } %7, 0 +; CHECK: ret i16 %ret +define i16 @test_cmpxchg_i16(i16* %arg, i16 %old, i16 %new) { + %ret_succ = cmpxchg i16* %arg, i16 %old, i16 %new seq_cst monotonic + %ret = extractvalue { i16, i1 } %ret_succ, 0 + ret i16 %ret +} + +; CHECK-LABEL: @test_add_i16( +; CHECK: %1 = bitcast i16* %arg to i8* +; CHECK: %2 = call i16 @__atomic_fetch_add_2(i8* %1, i16 %val, i32 5) +; CHECK: ret i16 %2 +define i16 @test_add_i16(i16* %arg, i16 %val) { + %ret = atomicrmw add i16* %arg, i16 %val seq_cst + ret i16 %ret +} + + +;; Now, check the output for the unsized libcalls. i128 is used for +;; these tests because the "16" suffixed functions aren't available on +;; 32-bit i386. + +; CHECK-LABEL: @test_load_i128( +; CHECK: %1 = bitcast i128* %arg to i8* +; CHECK: %2 = alloca i128, align 16 +; CHECK: %3 = bitcast i128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %3) +; CHECK: call void @__atomic_load(i32 16, i8* %1, i8* %3, i32 5) +; CHECK: %4 = load i128, i128* %2, align 16 +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %3) +; CHECK: ret i128 %4 +define i128 @test_load_i128(i128* %arg) { + %ret = load atomic i128, i128* %arg seq_cst, align 16 + ret i128 %ret +} + +; CHECK-LABEL @test_store_i128( +; CHECK: %1 = bitcast i128* %arg to i8* +; CHECK: %2 = alloca i128, align 16 +; CHECK: %3 = bitcast i128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %3) +; CHECK: store i128 %val, i128* %2, align 16 +; CHECK: call void @__atomic_store(i32 16, i8* %1, i8* %3, i32 5) +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %3) +; CHECK: ret void +define void @test_store_i128(i128* %arg, i128 %val) { + store atomic i128 %val, i128* %arg seq_cst, align 16 + ret void +} + +; CHECK-LABEL: @test_exchange_i128( +; CHECK: %1 = bitcast i128* %arg to i8* +; CHECK: %2 = alloca i128, align 16 +; CHECK: %3 = bitcast i128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %3) +; CHECK: store i128 %val, i128* %2, align 16 +; CHECK: %4 = alloca i128, align 16 +; CHECK: %5 = bitcast i128* %4 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %5) +; CHECK: call void @__atomic_exchange(i32 16, i8* %1, i8* %3, i8* %5, i32 5) +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %3) +; CHECK: %6 = load i128, i128* %4, align 16 +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %5) +; CHECK: ret i128 %6 +define i128 @test_exchange_i128(i128* %arg, i128 %val) { + %ret = atomicrmw xchg i128* %arg, i128 %val seq_cst + ret i128 %ret +} + +; CHECK-LABEL: @test_cmpxchg_i128( +; CHECK: %1 = bitcast i128* %arg to i8* +; CHECK: %2 = alloca i128, align 16 +; CHECK: %3 = bitcast i128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %3) +; CHECK: store i128 %old, i128* %2, align 16 +; CHECK: %4 = alloca i128, align 16 +; CHECK: %5 = bitcast i128* %4 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %5) +; CHECK: store i128 %new, i128* %4, align 16 +; CHECK: %6 = call zeroext i1 @__atomic_compare_exchange(i32 16, i8* %1, i8* %3, i8* %5, i32 5, i32 0) +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %5) +; CHECK: %7 = load i128, i128* %2, align 16 +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %3) +; CHECK: %8 = insertvalue { i128, i1 } undef, i128 %7, 0 +; CHECK: %9 = insertvalue { i128, i1 } %8, i1 %6, 1 +; CHECK: %ret = extractvalue { i128, i1 } %9, 0 +; CHECK: ret i128 %ret +define i128 @test_cmpxchg_i128(i128* %arg, i128 %old, i128 %new) { + %ret_succ = cmpxchg i128* %arg, i128 %old, i128 %new seq_cst monotonic + %ret = extractvalue { i128, i1 } %ret_succ, 0 + ret i128 %ret +} + +; This one is a verbose expansion, as there is no generic +; __atomic_fetch_add function, so it needs to expand to a cmpxchg +; loop, which then itself expands into a libcall. + +; CHECK-LABEL: @test_add_i128( +; CHECK: %1 = alloca i128, align 16 +; CHECK: %2 = alloca i128, align 16 +; CHECK: %3 = load i128, i128* %arg, align 16 +; CHECK: br label %atomicrmw.start +; CHECK:atomicrmw.start: +; CHECK: %loaded = phi i128 [ %3, %0 ], [ %newloaded, %atomicrmw.start ] +; CHECK: %new = add i128 %loaded, %val +; CHECK: %4 = bitcast i128* %arg to i8* +; CHECK: %5 = bitcast i128* %1 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %5) +; CHECK: store i128 %loaded, i128* %1, align 16 +; CHECK: %6 = bitcast i128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %6) +; CHECK: store i128 %new, i128* %2, align 16 +; CHECK: %7 = call zeroext i1 @__atomic_compare_exchange(i32 16, i8* %4, i8* %5, i8* %6, i32 5, i32 5) +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %6) +; CHECK: %8 = load i128, i128* %1, align 16 +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %5) +; CHECK: %9 = insertvalue { i128, i1 } undef, i128 %8, 0 +; CHECK: %10 = insertvalue { i128, i1 } %9, i1 %7, 1 +; CHECK: %success = extractvalue { i128, i1 } %10, 1 +; CHECK: %newloaded = extractvalue { i128, i1 } %10, 0 +; CHECK: br i1 %success, label %atomicrmw.end, label %atomicrmw.start +; CHECK:atomicrmw.end: +; CHECK: ret i128 %newloaded +define i128 @test_add_i128(i128* %arg, i128 %val) { + %ret = atomicrmw add i128* %arg, i128 %val seq_cst + ret i128 %ret +} + +;; Ensure that non-integer types get bitcast correctly on the way in and out of a libcall: + +; CHECK-LABEL: @test_load_double( +; CHECK: %1 = bitcast double* %arg to i8* +; CHECK: %2 = call i64 @__atomic_load_8(i8* %1, i32 5) +; CHECK: %3 = bitcast i64 %2 to double +; CHECK: ret double %3 +define double @test_load_double(double* %arg, double %val) { + %1 = load atomic double, double* %arg seq_cst, align 16 + ret double %1 +} + +; CHECK-LABEL: @test_store_double( +; CHECK: %1 = bitcast double* %arg to i8* +; CHECK: %2 = bitcast double %val to i64 +; CHECK: call void @__atomic_store_8(i8* %1, i64 %2, i32 5) +; CHECK: ret void +define void @test_store_double(double* %arg, double %val) { + store atomic double %val, double* %arg seq_cst, align 16 + ret void +} + +;; ...and for a non-integer type of large size too. + +; CHECK-LABEL: @test_store_fp128 +; CHECK: %1 = bitcast fp128* %arg to i8* +; CHECK: %2 = alloca fp128, align 16 +; CHECK: %3 = bitcast fp128* %2 to i8* +; CHECK: call void @llvm.lifetime.start(i64 16, i8* %3) +; CHECK: store fp128 %val, fp128* %2, align 16 +; CHECK: call void @__atomic_store(i32 16, i8* %1, i8* %3, i32 5) +; CHECK: call void @llvm.lifetime.end(i64 16, i8* %3) +; CHECK: ret void +define void @test_store_fp128(fp128* %arg, fp128 %val) { + store atomic fp128 %val, fp128* %arg seq_cst, align 16 + ret void +}