Index: docs/Atomics.rst
===================================================================
--- docs/Atomics.rst
+++ docs/Atomics.rst
@@ -417,14 +417,23 @@
 this is not correct in the IR sense of volatile, but CodeGen handles anything
 marked volatile very conservatively.  This should get fixed at some point.
 
-Common architectures have some way of representing at least a pointer-sized
-lock-free ``cmpxchg``; such an operation can be used to implement all the other
-atomic operations which can be represented in IR up to that size.  Backends are
-expected to implement all those operations, but not operations which cannot be
-implemented in a lock-free manner.  It is expected that backends will give an
-error when given an operation which cannot be implemented.  (The LLVM code
-generator is not very helpful here at the moment, but hopefully that will
-change.)
+One very important property of the atomic operations is that if your backend
+supports any inline lock-free atomic operations of a given size, you should
+support *ALL* operations of that size in a lock-free manner.
+
+When the target implements atomic cmpxchg or LL/SC instructions (as most do)
+this is trivial: all the other operations can be implemented on top of those
+primitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) there
+are atomic load and store instructions, but no cmpxchg or LL/SC. As it is
+invalid to implement ``atomic load`` using the native instruction, but
+``cmpxchg`` using a library call to a function that uses a mutex, ``atomic
+load`` must *also* expand to a library call on such architectures, so that it
+can remain atomic with regards to a simultaneous ``cmpxchg``, by using the same
+mutex.
+
+AtomicExpandPass can help with that: it will expand all atomic operations to the
+proper ``__atomic_*`` libcalls for any size above the maximum set by
+``setMaxAtomicSizeSupported`` (which defaults to 0).
 
 On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
 generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
@@ -450,10 +459,149 @@
   ``emitStoreConditional()``
 * large loads/stores -> ll-sc/cmpxchg
   by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
-* strong atomic accesses -> monotonic accesses + fences
-  by using ``setInsertFencesForAtomic()`` and overriding ``emitLeadingFence()``
-  and ``emitTrailingFence()``
+* strong atomic accesses -> monotonic accesses + fences by overriding
+  ``shouldInsertFencesForAtomic()``, ``emitLeadingFence()``, and
+  ``emitTrailingFence()``
 * atomic rmw -> loop with cmpxchg or load-linked/store-conditional
   by overriding ``expandAtomicRMWInIR()``
+* expansion to __atomic_* libcalls for unsupported sizes.
 
 For an example of all of these, look at the ARM backend.
+
+Libcalls: __atomic_*
+====================
+
+There are two kinds of atomic library calls that are generated by LLVM. Please
+note that both sets of library functions somewhat confusingly share the names of
+builtin functions defined by clang. Despite this, the library functions are
+not directly related to the builtins: it is *not* the case that ``__atomic_*``
+builtins lower to ``__atomic_*`` library calls and ``__sync_*`` builtins lower
+to ``__sync_*`` library calls.
+
+The first set of library functions are named ``__atomic_*``. This set has been
+"standardized" by GCC, and is described below. (See also `GCC's documentation
+<https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary>`_)
+
+LLVM's AtomicExpandPass will translate atomic operations on data sizes above
+``MaxAtomicSizeSupported`` into calls to these functions.
+
+There are four generic functions, which can be called with data of any size or
+alignment::
+
+   void __atomic_load(size_t size, void *ptr, void *ret, int ordering)
+   void __atomic_store(size_t size, void *ptr, void *val, int ordering)
+   void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, int ordering)
+   bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, void *desired, int success_order, int failure_order)
+
+There are also size-specialized versions of the above functions, which can only
+be used with *naturally-aligned* pointers of the appropriate size. In the
+signatures below, "N" is one of 1, 2, 4, 8, and 16, and "iN" is the appropriate
+integer type of that size; if no such integer type exists, the specialization
+cannot be used::
+
+   iN __atomic_load_N(iN *ptr, iN val, int ordering)
+   void __atomic_store_N(iN *ptr, iN val, int ordering)
+   iN __atomic_exchange_N(iN *ptr, iN val, int ordering)
+   bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, int success_order, int failure_order)
+
+Finally there are some read-modify-write functions, which are only available in
+the size-specific variants (any other sizes use a ``__atomic_compare_exchange``
+loop)::
+
+   iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering)
+   iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering)
+   iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering)
+   iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering)
+   iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering)
+   iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering)
+
+This set of library functions have some interesting implementation requirements
+to take note of:
+
+- They support all sizes and alignments -- including those which cannot be
+  implemented natively on any existing hardware. Therefore, they will certainly
+  use mutexes in for some sizes/alignments.
+
+- As a consequence, they cannot be shipped in a statically linked
+  compiler-support library, as they have state which must be shared amongst all
+  DSOs loaded in the program. They must be provided in a shared library used by
+  all objects.
+
+- The set of atomic sizes supported lock-free must be a superset of the sizes
+  any compiler can emit. That is: if a new compiler introduces support for
+  inline-lock-free atomics of size N, the ``__atomic_*`` functions must also have a
+  lock-free implementation for size N. This is a requirement so that code
+  produced by an old compiler (which will have called the ``__atomic_*`` function)
+  interoperates with code produced by the new compiler (which will use native
+  the atomic instruction).
+
+Note that it's possible to write an entirely target-independent implementation
+of these library functions by using the compiler atomic builtins themselves to
+implement the operations on naturally-aligned pointers of supported sizes, and a
+generic mutex implementation otherwise.
+
+Libcalls: __sync_*
+==================
+
+Some targets or OS/target combinations can support lock-free atomics, but for
+various reasons, it is not practical to emit the instructions inline.
+
+There's two typical examples of this.
+
+Some CPUs support multiple instruction sets which can be swiched back and forth
+on function-call boundaries. For example, MIPS supports the MIPS16 ISA, which is
+has a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly,
+has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomic
+instructions are not encodable. However, those instructions are available via a
+function call to a function with the longer encoding.
+
+Additionally, a few OS/target pairs provide kernel-supported lock-free
+atomics. ARM/Linux is an example of this: the kernel provides a function which
+on older CPUs contains a "magically-restartable" atomic sequence (which looks
+atomic so long as there's only one CPU), and contains actual atomic instructions
+on newer multicore models. This sort of functionality can typically be provided
+on any architecture, if all CPUs which are missing cmpxchg support are
+uniprocessor (no SMP). This is almost always the case. The only common
+architecture without that property is SPARC -- SPARCV8 SMP systems were common,
+yet it doesn't support cmpxchg.
+
+In either of these cases, the Target in LLVM can claim support for atomics of an
+appropriate size, and then implement some subset of the operations via libcalls
+to a ``__sync_*`` function. Such functions *must* not use locks in their
+implementation, because unlike the ``__atomic_*`` routines used by
+AtomicExpandPass, these may be mixed-and-matched with native instructions by the
+target lowering.
+
+Further, these routines do not need to be shared, as they are stateless. So,
+there is no issue with having multiple copies included in one binary. Thus,
+typically these routines are implemented by the statically-linked compiler
+runtime support library.
+
+LLVM will emit a call to an appropriate ``__sync_*`` routine if the target
+ISelLowering code has set the corresponding ``ATOMIC_CMPXCHG``, ``ATOMIC_SWAP``,
+or ``ATOMIC_LOAD_*`` operation to "Expand", and if it has opted-into the
+availablity of those library functions via a call to ``initSyncLibcalls()``.
+
+The full set of functions that may be called by LLVM is (for ``N`` being 1, 2,
+4, 8, or 16)::
+
+  iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired)
+  iN __sync_lock_test_and_set_N(iN *ptr, iN val)
+  iN __sync_fetch_and_add_N(iN *ptr, iN val)
+  iN __sync_fetch_and_sub_N(iN *ptr, iN val)
+  iN __sync_fetch_and_and_N(iN *ptr, iN val)
+  iN __sync_fetch_and_or_N(iN *ptr, iN val)
+  iN __sync_fetch_and_xor_N(iN *ptr, iN val)
+  iN __sync_fetch_and_nand_N(iN *ptr, iN val)
+  iN __sync_fetch_and_max_N(iN *ptr, iN val)
+  iN __sync_fetch_and_umax_N(iN *ptr, iN val)
+  iN __sync_fetch_and_min_N(iN *ptr, iN val)
+  iN __sync_fetch_and_umin_N(iN *ptr, iN val)
+
+This list doesn't include any function for atomic load or store; all known
+architectures support atomic loads and stores directly (possibly by emitting a
+fence on either side of a normal load or store.)
+
+There's also, somewhat separately, the possibility to lower ``ATOMIC_FENCE`` to
+``__sync_synchronize()``. This may happen or not happen independent of all the
+above, controlled purely by ``setOperationAction(ISD::ATOMIC_FENCE, ...)``.
Index: include/llvm/CodeGen/RuntimeLibcalls.h
===================================================================
--- include/llvm/CodeGen/RuntimeLibcalls.h
+++ include/llvm/CodeGen/RuntimeLibcalls.h
@@ -398,6 +398,73 @@
     SYNC_FETCH_AND_UMIN_8,
     SYNC_FETCH_AND_UMIN_16,
 
+    // New style atomics.
+    ATOMIC_LOAD,
+    ATOMIC_LOAD_1,
+    ATOMIC_LOAD_2,
+    ATOMIC_LOAD_4,
+    ATOMIC_LOAD_8,
+    ATOMIC_LOAD_16,
+
+    ATOMIC_STORE,
+    ATOMIC_STORE_1,
+    ATOMIC_STORE_2,
+    ATOMIC_STORE_4,
+    ATOMIC_STORE_8,
+    ATOMIC_STORE_16,
+
+    ATOMIC_EXCHANGE,
+    ATOMIC_EXCHANGE_1,
+    ATOMIC_EXCHANGE_2,
+    ATOMIC_EXCHANGE_4,
+    ATOMIC_EXCHANGE_8,
+    ATOMIC_EXCHANGE_16,
+
+    ATOMIC_COMPARE_EXCHANGE,
+    ATOMIC_COMPARE_EXCHANGE_1,
+    ATOMIC_COMPARE_EXCHANGE_2,
+    ATOMIC_COMPARE_EXCHANGE_4,
+    ATOMIC_COMPARE_EXCHANGE_8,
+    ATOMIC_COMPARE_EXCHANGE_16,
+
+    ATOMIC_FETCH_ADD_1,
+    ATOMIC_FETCH_ADD_2,
+    ATOMIC_FETCH_ADD_4,
+    ATOMIC_FETCH_ADD_8,
+    ATOMIC_FETCH_ADD_16,
+
+    ATOMIC_FETCH_SUB_1,
+    ATOMIC_FETCH_SUB_2,
+    ATOMIC_FETCH_SUB_4,
+    ATOMIC_FETCH_SUB_8,
+    ATOMIC_FETCH_SUB_16,
+
+    ATOMIC_FETCH_AND_1,
+    ATOMIC_FETCH_AND_2,
+    ATOMIC_FETCH_AND_4,
+    ATOMIC_FETCH_AND_8,
+    ATOMIC_FETCH_AND_16,
+
+    ATOMIC_FETCH_OR_1,
+    ATOMIC_FETCH_OR_2,
+    ATOMIC_FETCH_OR_4,
+    ATOMIC_FETCH_OR_8,
+    ATOMIC_FETCH_OR_16,
+
+    ATOMIC_FETCH_XOR_1,
+    ATOMIC_FETCH_XOR_2,
+    ATOMIC_FETCH_XOR_4,
+    ATOMIC_FETCH_XOR_8,
+    ATOMIC_FETCH_XOR_16,
+
+    ATOMIC_FETCH_NAND_1,
+    ATOMIC_FETCH_NAND_2,
+    ATOMIC_FETCH_NAND_4,
+    ATOMIC_FETCH_NAND_8,
+    ATOMIC_FETCH_NAND_16,
+
+    ATOMIC_IS_LOCK_FREE,
+
     // Stack Protector Fail.
     STACKPROTECTOR_CHECK_FAIL,
 
@@ -430,7 +497,7 @@
 
   /// Return the SYNC_FETCH_AND_* value for the given opcode and type, or
   /// UNKNOWN_LIBCALL if there is none.
-  Libcall getATOMIC(unsigned Opc, MVT VT);
+  Libcall getSYNC(unsigned Opc, MVT VT);
 }
 }
 
Index: include/llvm/Target/TargetLowering.h
===================================================================
--- include/llvm/Target/TargetLowering.h
+++ include/llvm/Target/TargetLowering.h
@@ -1003,12 +1003,6 @@
     return PrefLoopAlignment;
   }
 
-  /// Return whether the DAG builder should automatically insert fences and
-  /// reduce ordering for atomics.
-  bool getInsertFencesForAtomic() const {
-    return InsertFencesForAtomic;
-  }
-
   /// Return true if the target stores stack protector cookies at a fixed offset
   /// in some non-standard address space, and populates the address space and
   /// offset as appropriate.
@@ -1052,6 +1046,19 @@
   /// \name Helpers for atomic expansion.
   /// @{
 
+  /// Returns the maximum atomic operation size (in bits) supported by
+  /// the backend. Atomic operations greater than this size (as well
+  /// as ones that are not naturally aligned), will be expanded by
+  /// AtomicExpandPass into an __atomic_* library call.
+  unsigned getMaxAtomicSizeSupported() const { return MaxAtomicSizeSupported; }
+
+  /// Whether the DAG builder should automatically insert fences and
+  /// reduce ordering for this atomic. This should be true for
+  /// most architectures with weak memory ordering. Defaults to false.
+  virtual bool shouldInsertFencesForAtomic(const Instruction *I) const {
+    return false;
+  }
+
   /// Perform a load-linked operation on Addr, returning a "Value *" with the
   /// corresponding pointee type. This may entail some non-trivial operations to
   /// truncate or reconstruct types that will be illegal in the backend. See
@@ -1070,12 +1077,12 @@
 
   /// Inserts in the IR a target-specific intrinsic specifying a fence.
   /// It is called by AtomicExpandPass before expanding an
-  ///   AtomicRMW/AtomicCmpXchg/AtomicStore/AtomicLoad.
+  ///   AtomicRMW/AtomicCmpXchg/AtomicStore/AtomicLoad
+  ///   if shouldInsertFencesForAtomic returns true.
   /// RMW and CmpXchg set both IsStore and IsLoad to true.
   /// This function should either return a nullptr, or a pointer to an IR-level
   ///   Instruction*. Even complex fence sequences can be represented by a
   ///   single Instruction* through an intrinsic to be lowered later.
-  /// Backends with !getInsertFencesForAtomic() should keep a no-op here.
   /// Backends should override this method to produce target-specific intrinsic
   ///   for their fences.
   /// FIXME: Please note that the default implementation here in terms of
@@ -1101,9 +1108,6 @@
   virtual Instruction *emitLeadingFence(IRBuilder<> &Builder,
                                         AtomicOrdering Ord, bool IsStore,
                                         bool IsLoad) const {
-    if (!getInsertFencesForAtomic())
-      return nullptr;
-
     if (isAtLeastRelease(Ord) && IsStore)
       return Builder.CreateFence(Ord);
     else
@@ -1113,9 +1117,6 @@
   virtual Instruction *emitTrailingFence(IRBuilder<> &Builder,
                                          AtomicOrdering Ord, bool IsStore,
                                          bool IsLoad) const {
-    if (!getInsertFencesForAtomic())
-      return nullptr;
-
     if (isAtLeastAcquire(Ord))
       return Builder.CreateFence(Ord);
     else
@@ -1441,10 +1442,12 @@
     MinStackArgumentAlignment = Align;
   }
 
-  /// Set if the DAG builder should automatically insert fences and reduce the
-  /// order of atomic memory operations to Monotonic.
-  void setInsertFencesForAtomic(bool fence) {
-    InsertFencesForAtomic = fence;
+  /// Set the maximum atomic operation size supported by the
+  /// backend. Atomic operations greater than this size (as well as
+  /// ones that are not naturally aligned), will be expanded by
+  /// AtomicExpandPass into an __atomic_* library call.
+  void setMaxAtomicSizeSupported(unsigned Size) {
+    MaxAtomicSizeSupported = Size;
   }
 
 public:
@@ -1856,10 +1859,9 @@
   /// The preferred loop alignment.
   unsigned PrefLoopAlignment;
 
-  /// Whether the DAG builder should automatically insert fences and reduce
-  /// ordering for atomics.  (This will be set for for most architectures with
-  /// weak memory ordering.)
-  bool InsertFencesForAtomic;
+  /// Size in bits of the maximum atomics size the backend supports.
+  /// Accesses larger than this will be expanded by AtomicExpandPass.
+  unsigned MaxAtomicSizeSupported;
 
   /// If set to a physical register, this specifies the register that
   /// llvm.savestack/llvm.restorestack should save and restore.
Index: lib/CodeGen/AtomicExpandPass.cpp
===================================================================
--- lib/CodeGen/AtomicExpandPass.cpp
+++ lib/CodeGen/AtomicExpandPass.cpp
@@ -8,10 +8,10 @@
 //===----------------------------------------------------------------------===//
 //
 // This file contains a pass (at IR level) to replace atomic instructions with
-// target specific instruction which implement the same semantics in a way
-// which better fits the target backend.  This can include the use of either
-// (intrinsic-based) load-linked/store-conditional loops, AtomicCmpXchg, or
-// type coercions.
+// __atomic_* library calls, or target specific instruction which implement the
+// same semantics in a way which better fits the target backend.  This can
+// include the use of (intrinsic-based) load-linked/store-conditional loops,
+// AtomicCmpXchg, or type coercions.
 //
 //===----------------------------------------------------------------------===//
 
@@ -64,19 +64,93 @@
     bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI);
     bool isIdempotentRMW(AtomicRMWInst *AI);
     bool simplifyIdempotentRMW(AtomicRMWInst *AI);
+
+    bool expandAtomicOpToLibcall(Instruction *I, unsigned Size, unsigned Align,
+                                 Value *PointerOperand, Value *ValueOperand,
+                                 Value *CASExpected, AtomicOrdering Ordering,
+                                 AtomicOrdering Ordering2,
+                                 const RTLIB::Libcall *Libcalls);
+    void expandAtomicLoadToLibcall(LoadInst *LI);
+    void expandAtomicStoreToLibcall(StoreInst *LI);
+    void expandAtomicRMWToLibcall(AtomicRMWInst *I);
+    void expandAtomicCASToLibcall(AtomicCmpXchgInst *I);
   };
 }
 
 char AtomicExpand::ID = 0;
 char &llvm::AtomicExpandID = AtomicExpand::ID;
-INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand",
-    "Expand Atomic calls in terms of either load-linked & store-conditional or cmpxchg",
-    false, false)
+INITIALIZE_TM_PASS(AtomicExpand, "atomic-expand", "Expand Atomic instructions",
+                   false, false)
 
 FunctionPass *llvm::createAtomicExpandPass(const TargetMachine *TM) {
   return new AtomicExpand(TM);
 }
 
+namespace {
+// Helper functions to retrieve the size of atomic instructions.
+unsigned getAtomicOpSize(LoadInst *LI) {
+  const DataLayout &DL = LI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(LI->getType());
+}
+
+unsigned getAtomicOpSize(StoreInst *SI) {
+  const DataLayout &DL = SI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(SI->getValueOperand()->getType());
+}
+
+unsigned getAtomicOpSize(AtomicRMWInst *RMWI) {
+  const DataLayout &DL = RMWI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(RMWI->getValOperand()->getType());
+}
+
+unsigned getAtomicOpSize(AtomicCmpXchgInst *CASI) {
+  const DataLayout &DL = CASI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(CASI->getCompareOperand()->getType());
+}
+
+// Helper functions to retrieve the alignment of atomic instructions.
+unsigned getAtomicOpAlign(LoadInst *LI) {
+  const DataLayout &DL = LI->getModule()->getDataLayout();
+  unsigned Align = LI->getAlignment();
+  if (Align == 0)
+    return DL.getABITypeAlignment(LI->getType());
+  return Align;
+}
+
+unsigned getAtomicOpAlign(StoreInst *SI) {
+  const DataLayout &DL = SI->getModule()->getDataLayout();
+  unsigned Align = SI->getAlignment();
+  if (Align == 0)
+    return DL.getABITypeAlignment(SI->getValueOperand()->getType());
+  return Align;
+}
+
+unsigned getAtomicOpAlign(AtomicRMWInst *RMWI) {
+  // TODO: This instruction has no alignment attribute, but unlike the
+  // default alignment for load/store, the default here is to assume
+  // it has NATURAL alignment, not DataLayout-specified alignment.
+  const DataLayout &DL = RMWI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(RMWI->getValOperand()->getType());
+}
+
+unsigned getAtomicOpAlign(AtomicCmpXchgInst *CASI) {
+  // TODO: same comment as above.
+  const DataLayout &DL = CASI->getModule()->getDataLayout();
+  return DL.getTypeStoreSize(CASI->getCompareOperand()->getType());
+}
+
+// Determine if a particular atomic operation has a supported size,
+// and is of appropriate alignment, to be passed through for target
+// lowering. (Versus turning into a __atomic libcall)
+template <typename Inst>
+bool atomicSizeSupported(const TargetLowering *TLI, Inst *I) {
+  unsigned Size = getAtomicOpSize(I);
+  unsigned Align = getAtomicOpAlign(I);
+  return Align >= Size && Size <= TLI->getMaxAtomicSizeSupported() / 8;
+}
+
+} // end anonymous namespace
+
 bool AtomicExpand::runOnFunction(Function &F) {
   if (!TM || !TM->getSubtargetImpl(F)->enableAtomicExpand())
     return false;
@@ -93,14 +167,43 @@
 
   bool MadeChange = false;
   for (auto I : AtomicInsts) {
+    if (isa<FenceInst>(I))
+      continue;
+
     auto LI = dyn_cast<LoadInst>(I);
     auto SI = dyn_cast<StoreInst>(I);
     auto RMWI = dyn_cast<AtomicRMWInst>(I);
     auto CASI = dyn_cast<AtomicCmpXchgInst>(I);
-    assert((LI || SI || RMWI || CASI || isa<FenceInst>(I)) &&
-           "Unknown atomic instruction");
+    assert((LI || SI || RMWI || CASI) && "Unknown atomic instruction");
 
-    if (TLI->getInsertFencesForAtomic()) {
+    // If the Size/Alignment is not supported, replace with a libcall.
+    if (LI) {
+      if (!atomicSizeSupported(TLI, LI)) {
+        expandAtomicLoadToLibcall(LI);
+        MadeChange = true;
+        continue;
+      }
+    } else if (SI) {
+      if (!atomicSizeSupported(TLI, SI)) {
+        expandAtomicStoreToLibcall(SI);
+        MadeChange = true;
+        continue;
+      }
+    } else if (RMWI) {
+      if (!atomicSizeSupported(TLI, RMWI)) {
+        expandAtomicRMWToLibcall(RMWI);
+        MadeChange = true;
+        continue;
+      }
+    } else if (CASI) {
+      if (!atomicSizeSupported(TLI, CASI)) {
+        expandAtomicCASToLibcall(CASI);
+        MadeChange = true;
+        continue;
+      }
+    }
+
+    if (TLI->shouldInsertFencesForAtomic(I)) {
       auto FenceOrdering = Monotonic;
       bool IsStore, IsLoad;
       if (LI && isAtLeastAcquire(LI->getOrdering())) {
@@ -144,7 +247,7 @@
         assert(LI->getType()->isIntegerTy() && "invariant broken");
         MadeChange = true;
       }
-      
+
       MadeChange |= tryExpandAtomicLoad(LI);
     } else if (SI) {
       if (SI->getValueOperand()->getType()->isFloatingPointTy()) {
@@ -514,12 +617,14 @@
   BasicBlock *BB = CI->getParent();
   Function *F = BB->getParent();
   LLVMContext &Ctx = F->getContext();
-  // If getInsertFencesForAtomic() returns true, then the target does not want
+  // If shouldInsertFencesForAtomic() returns true, then the target does not
+  // want
   // to deal with memory orders, and emitLeading/TrailingFence should take care
   // of everything. Otherwise, emitLeading/TrailingFence are no-op and we
   // should preserve the ordering.
+  bool ShouldInsertFencesForAtomic = TLI->shouldInsertFencesForAtomic(CI);
   AtomicOrdering MemOpOrder =
-      TLI->getInsertFencesForAtomic() ? Monotonic : SuccessOrder;
+      ShouldInsertFencesForAtomic ? Monotonic : SuccessOrder;
 
   // In implementations which use a barrier to achieve release semantics, we can
   // delay emitting this barrier until we know a store is actually going to be
@@ -530,7 +635,7 @@
   // since in other cases the extra blocks naturally collapse down to the
   // minimal loop. Unfortunately, this puts too much stress on later
   // optimisations so we avoid emitting the extra logic in those cases too.
-  bool HasReleasedLoadBB = !CI->isWeak() && TLI->getInsertFencesForAtomic() &&
+  bool HasReleasedLoadBB = !CI->isWeak() && ShouldInsertFencesForAtomic &&
                            SuccessOrder != Monotonic &&
                            SuccessOrder != Acquire && !F->optForMinSize();
 
@@ -601,7 +706,7 @@
   // the branch entirely.
   std::prev(BB->end())->eraseFromParent();
   Builder.SetInsertPoint(BB);
-  if (UseUnconditionalReleaseBarrier)
+  if (ShouldInsertFencesForAtomic && UseUnconditionalReleaseBarrier)
     TLI->emitLeadingFence(Builder, SuccessOrder, /*IsStore=*/true,
                           /*IsLoad=*/true);
   Builder.CreateBr(StartBB);
@@ -617,7 +722,7 @@
   Builder.CreateCondBr(ShouldStore, ReleasingStoreBB, NoStoreBB);
 
   Builder.SetInsertPoint(ReleasingStoreBB);
-  if (!UseUnconditionalReleaseBarrier)
+  if (ShouldInsertFencesForAtomic && !UseUnconditionalReleaseBarrier)
     TLI->emitLeadingFence(Builder, SuccessOrder, /*IsStore=*/true,
                           /*IsLoad=*/true);
   Builder.CreateBr(TryStoreBB);
@@ -647,8 +752,9 @@
   // Make sure later instructions don't get reordered with a fence if
   // necessary.
   Builder.SetInsertPoint(SuccessBB);
-  TLI->emitTrailingFence(Builder, SuccessOrder, /*IsStore=*/true,
-                         /*IsLoad=*/true);
+  if (ShouldInsertFencesForAtomic)
+    TLI->emitTrailingFence(Builder, SuccessOrder, /*IsStore=*/true,
+                           /*IsLoad=*/true);
   Builder.CreateBr(ExitBB);
 
   Builder.SetInsertPoint(NoStoreBB);
@@ -659,8 +765,9 @@
   Builder.CreateBr(FailureBB);
 
   Builder.SetInsertPoint(FailureBB);
-  TLI->emitTrailingFence(Builder, FailureOrder, /*IsStore=*/true,
-                         /*IsLoad=*/true);
+  if (ShouldInsertFencesForAtomic)
+    TLI->emitTrailingFence(Builder, FailureOrder, /*IsStore=*/true,
+                           /*IsLoad=*/true);
   Builder.CreateBr(ExitBB);
 
   // Finally, we have control-flow based knowledge of whether the cmpxchg
@@ -828,3 +935,385 @@
 
   return true;
 }
+
+// This converts from LLVM's internal AtomicOrdering enum to the
+// memory_order_* value required by the __atomic_* libcalls.
+static int libcallAtomicModel(AtomicOrdering AO) {
+  switch (AO) {
+  case NotAtomic:
+    llvm_unreachable("Expected atomic memory order.");
+  case Unordered:
+  case Monotonic:
+    return 0; // memory_order_relaxed
+  // Not implemented yet in llvm:
+  // case Consume:
+  //  return 1; // memory_order_consume
+  case Acquire:
+    return 2; // memory_order_acquire
+  case Release:
+    return 3; // memory_order_release
+  case AcquireRelease:
+    return 4; // memory_order_acq_rel
+  case SequentiallyConsistent:
+    return 5; // memory_order_seq_cst
+  }
+  llvm_unreachable("Unknown atomic memory order.");
+}
+
+// In order to use one of the sized library calls such as
+// __atomic_fetch_add_4, the alignment must be sufficient, the size
+// must be one of the potentially-specialized sizes, and the value
+// type must actually exist in C on the target (otherwise, the
+// function wouldn't actually be defined.)
+static bool canUseSizedAtomicCall(unsigned Size, unsigned Align,
+                                  const DataLayout &DL) {
+  // TODO: "LargestSize" is an approximation for "largest type that
+  // you can express in C". It seems to be the case that int128 is
+  // supported on all 64-bit platforms, otherwise only up to 64-bit
+  // integers are supported. If we get this wrong, then we'll try to
+  // call a sized libcall that doesn't actually exist. There should
+  // really be some more reliable way in LLVM of determining integer
+  // sizes which are valid in the target's C ABI...
+  unsigned LargestSize = DL.getLargestLegalIntTypeSize() >= 64 ? 16 : 8;
+  return Align >= Size &&
+         (Size == 1 || Size == 2 || Size == 4 || Size == 8 || Size == 16) &&
+         Size <= LargestSize;
+}
+
+void AtomicExpand::expandAtomicLoadToLibcall(LoadInst *I) {
+  static const RTLIB::Libcall Libcalls[6] = {
+      RTLIB::ATOMIC_LOAD,   RTLIB::ATOMIC_LOAD_1, RTLIB::ATOMIC_LOAD_2,
+      RTLIB::ATOMIC_LOAD_4, RTLIB::ATOMIC_LOAD_8, RTLIB::ATOMIC_LOAD_16};
+  unsigned Size = getAtomicOpSize(I);
+  unsigned Align = getAtomicOpAlign(I);
+
+  if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(), nullptr,
+                               nullptr, I->getOrdering(),
+                               AtomicOrdering::NotAtomic, Libcalls))
+    llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor Load");
+}
+
+void AtomicExpand::expandAtomicStoreToLibcall(StoreInst *I) {
+  static const RTLIB::Libcall Libcalls[6] = {
+      RTLIB::ATOMIC_STORE,   RTLIB::ATOMIC_STORE_1, RTLIB::ATOMIC_STORE_2,
+      RTLIB::ATOMIC_STORE_4, RTLIB::ATOMIC_STORE_8, RTLIB::ATOMIC_STORE_16};
+  unsigned Size = getAtomicOpSize(I);
+  unsigned Align = getAtomicOpAlign(I);
+
+  if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(),
+                               I->getValueOperand(), nullptr, I->getOrdering(),
+                               AtomicOrdering::NotAtomic, Libcalls))
+    llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor Store");
+}
+
+void AtomicExpand::expandAtomicCASToLibcall(AtomicCmpXchgInst *I) {
+  static const RTLIB::Libcall Libcalls[6] = {
+      RTLIB::ATOMIC_COMPARE_EXCHANGE,   RTLIB::ATOMIC_COMPARE_EXCHANGE_1,
+      RTLIB::ATOMIC_COMPARE_EXCHANGE_2, RTLIB::ATOMIC_COMPARE_EXCHANGE_4,
+      RTLIB::ATOMIC_COMPARE_EXCHANGE_8, RTLIB::ATOMIC_COMPARE_EXCHANGE_16};
+  unsigned Size = getAtomicOpSize(I);
+  unsigned Align = getAtomicOpAlign(I);
+
+  if (!expandAtomicOpToLibcall(I, Size, Align, I->getPointerOperand(),
+                               I->getNewValOperand(), I->getCompareOperand(),
+                               I->getSuccessOrdering(), I->getFailureOrdering(),
+                               Libcalls))
+    llvm_unreachable("expandAtomicOpToLibcall shouldn't fail tor CAS");
+}
+
+void AtomicExpand::expandAtomicRMWToLibcall(AtomicRMWInst *I) {
+  static const RTLIB::Libcall LibcallsXchg[6] = {
+      RTLIB::ATOMIC_EXCHANGE,   RTLIB::ATOMIC_EXCHANGE_1,
+      RTLIB::ATOMIC_EXCHANGE_2, RTLIB::ATOMIC_EXCHANGE_4,
+      RTLIB::ATOMIC_EXCHANGE_8, RTLIB::ATOMIC_EXCHANGE_16};
+  static const RTLIB::Libcall LibcallsAdd[6] = {
+      RTLIB::UNKNOWN_LIBCALL,    RTLIB::ATOMIC_FETCH_ADD_1,
+      RTLIB::ATOMIC_FETCH_ADD_2, RTLIB::ATOMIC_FETCH_ADD_4,
+      RTLIB::ATOMIC_FETCH_ADD_8, RTLIB::ATOMIC_FETCH_ADD_16};
+  static const RTLIB::Libcall LibcallsSub[6] = {
+      RTLIB::UNKNOWN_LIBCALL,    RTLIB::ATOMIC_FETCH_SUB_1,
+      RTLIB::ATOMIC_FETCH_SUB_2, RTLIB::ATOMIC_FETCH_SUB_4,
+      RTLIB::ATOMIC_FETCH_SUB_8, RTLIB::ATOMIC_FETCH_SUB_16};
+  static const RTLIB::Libcall LibcallsAnd[6] = {
+      RTLIB::UNKNOWN_LIBCALL,    RTLIB::ATOMIC_FETCH_AND_1,
+      RTLIB::ATOMIC_FETCH_AND_2, RTLIB::ATOMIC_FETCH_AND_4,
+      RTLIB::ATOMIC_FETCH_AND_8, RTLIB::ATOMIC_FETCH_AND_16};
+  static const RTLIB::Libcall LibcallsOr[6] = {
+      RTLIB::UNKNOWN_LIBCALL,   RTLIB::ATOMIC_FETCH_OR_1,
+      RTLIB::ATOMIC_FETCH_OR_2, RTLIB::ATOMIC_FETCH_OR_4,
+      RTLIB::ATOMIC_FETCH_OR_8, RTLIB::ATOMIC_FETCH_OR_16};
+  static const RTLIB::Libcall LibcallsXor[6] = {
+      RTLIB::UNKNOWN_LIBCALL,    RTLIB::ATOMIC_FETCH_XOR_1,
+      RTLIB::ATOMIC_FETCH_XOR_2, RTLIB::ATOMIC_FETCH_XOR_4,
+      RTLIB::ATOMIC_FETCH_XOR_8, RTLIB::ATOMIC_FETCH_XOR_16};
+  static const RTLIB::Libcall LibcallsNand[6] = {
+      RTLIB::UNKNOWN_LIBCALL,     RTLIB::ATOMIC_FETCH_NAND_1,
+      RTLIB::ATOMIC_FETCH_NAND_2, RTLIB::ATOMIC_FETCH_NAND_4,
+      RTLIB::ATOMIC_FETCH_NAND_8, RTLIB::ATOMIC_FETCH_NAND_16};
+
+  const RTLIB::Libcall *Libcalls;
+  switch (I->getOperation()) {
+  case AtomicRMWInst::Xchg:
+    Libcalls = LibcallsXchg;
+    break;
+  case AtomicRMWInst::Add:
+    Libcalls = LibcallsAdd;
+    break;
+  case AtomicRMWInst::Sub:
+    Libcalls = LibcallsSub;
+    break;
+  case AtomicRMWInst::And:
+    Libcalls = LibcallsAnd;
+    break;
+  case AtomicRMWInst::Or:
+    Libcalls = LibcallsOr;
+    break;
+  case AtomicRMWInst::Xor:
+    Libcalls = LibcallsXor;
+    break;
+  case AtomicRMWInst::Nand:
+    Libcalls = LibcallsNand;
+    break;
+  case AtomicRMWInst::Max:
+  case AtomicRMWInst::Min:
+  case AtomicRMWInst::UMax:
+  case AtomicRMWInst::UMin:
+    // No atomic libcalls are available for max/min/umax/umin.
+    Libcalls = nullptr;
+    break;
+  default:
+    llvm_unreachable("Unexpected RMW operation");
+  }
+
+  unsigned Size = getAtomicOpSize(I);
+  unsigned Align = getAtomicOpAlign(I);
+
+  bool Success = Libcalls && expandAtomicOpToLibcall(
+                                 I, Size, Align, I->getPointerOperand(),
+                                 I->getValOperand(), nullptr, I->getOrdering(),
+                                 AtomicOrdering::NotAtomic, Libcalls);
+
+  // The expansion failed: either there were no libcalls at all for
+  // the operation (min/max), or there were only size-specialized
+  // libcalls (add/sub/etc) and we needed a generic. So, expand to a
+  // CAS loop instead.
+  if (!Success) {
+    expandAtomicRMWToCmpXchg(I, [this](IRBuilder<> &Builder, Value *Addr,
+                                       Value *Loaded, Value *NewVal,
+                                       AtomicOrdering MemOpOrder,
+                                       Value *&Success, Value *&NewLoaded) {
+      // Create the CAS instruction normally...
+      AtomicCmpXchgInst *Pair = Builder.CreateAtomicCmpXchg(
+          Addr, Loaded, NewVal, MemOpOrder,
+          AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder));
+      Success = Builder.CreateExtractValue(Pair, 1, "success");
+      NewLoaded = Builder.CreateExtractValue(Pair, 0, "newloaded");
+
+      // ...and then expand the CAS into a libcall.
+      expandAtomicCASToLibcall(Pair);
+    });
+  }
+}
+
+// A helper routine for the above expandAtomic*ToLibcall functions.
+//
+// 'Libcalls' contains an array of enum values for the particular
+// ATOMIC libcalls to be emitted. All of the other arguments besides
+// 'I' are extracted from the Instruction subclass by the
+// caller. Depending on the particular call, some will be null.
+bool AtomicExpand::expandAtomicOpToLibcall(
+    Instruction *I, unsigned Size, unsigned Align, Value *PointerOperand,
+    Value *ValueOperand, Value *CASExpected, AtomicOrdering Ordering,
+    AtomicOrdering Ordering2, const RTLIB::Libcall *Libcalls) {
+  LLVMContext &Ctx = I->getContext();
+  Module *M = I->getModule();
+  const DataLayout &DL = M->getDataLayout();
+  IRBuilder<> Builder(I);
+  IRBuilder<> AllocaBuilder(&I->getFunction()->getEntryBlock().front());
+
+  unsigned AllocaAlignment = std::min(Size, 16u);
+  bool UseSizedLibcall = canUseSizedAtomicCall(Size, Align, DL);
+
+  Type *SizedIntTy = Type::getIntNTy(Ctx, Size * 8);
+
+  // TODO: the "order" argument type is "int", not int32. So
+  // getInt32Ty may be wrong if the arch uses e.g. 16-bit ints.
+  ConstantInt *SizeVal64 = ConstantInt::get(Type::getInt64Ty(Ctx), Size);
+  Constant *OrderingVal =
+      ConstantInt::get(Type::getInt32Ty(Ctx), libcallAtomicModel(Ordering));
+  Constant *Ordering2Val = CASExpected
+                               ? ConstantInt::get(Type::getInt32Ty(Ctx),
+                                                  libcallAtomicModel(Ordering2))
+                               : nullptr;
+  bool HasResult = I->getType() != Type::getVoidTy(Ctx);
+
+  RTLIB::Libcall RTLibType;
+  if (UseSizedLibcall) {
+    switch (Size) {
+    case 1:
+      RTLibType = Libcalls[1];
+      break;
+    case 2:
+      RTLibType = Libcalls[2];
+      break;
+    case 4:
+      RTLibType = Libcalls[3];
+      break;
+    case 8:
+      RTLibType = Libcalls[4];
+      break;
+    case 16:
+      RTLibType = Libcalls[5];
+      break;
+    }
+  } else if (Libcalls[0] != RTLIB::UNKNOWN_LIBCALL) {
+    RTLibType = Libcalls[0];
+  } else {
+    // Can't use sized function, and there's no generic for this
+    // operation, so give up.
+    return false;
+  }
+
+  // Build up the function call. There's two kinds. First, the sized
+  // variants.  These calls are going to be one of the following (with
+  // N=1,2,4,8,16):
+  //  iN    __atomic_load_N(iN *ptr, int ordering)
+  //  void  __atomic_store_N(iN *ptr, iN val, int ordering)
+  //  iN    __atomic_{exchange|fetch_*}_N(iN *ptr, iN val, int ordering)
+  //  bool  __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired,
+  //                                    int success_order, int failure_order)
+  //
+  // Note that these functions can be used for non-integer atomic
+  // operations, the values just need to be bitcast to integers on the
+  // way in and out.
+  //
+  // And, then, the generic variants. They look like the following:
+  //  void  __atomic_load(size_t size, void *ptr, void *ret, int ordering)
+  //  void  __atomic_store(size_t size, void *ptr, void *val, int ordering)
+  //  void  __atomic_exchange(size_t size, void *ptr, void *val, void *ret,
+  //                          int ordering)
+  //  bool  __atomic_compare_exchange(size_t size, void *ptr, void *expected,
+  //                                  void *desired, int success_order,
+  //                                  int failure_order)
+  //
+  // The different signatures are built up depending on the
+  // 'UseSizedLibcall', 'CASExpected', 'ValueOperand', and 'HasResult'
+  // variables.
+
+  AllocaInst *AllocaCASExpected = nullptr;
+  Value *AllocaCASExpected_i8 = nullptr;
+  AllocaInst *AllocaValue = nullptr;
+  Value *AllocaValue_i8 = nullptr;
+  AllocaInst *AllocaResult = nullptr;
+  Value *AllocaResult_i8 = nullptr;
+
+  Type *ResultTy;
+  SmallVector<Value *, 6> Args;
+  AttributeSet Attr;
+
+  // 'size' argument.
+  if (!UseSizedLibcall) {
+    // Note, getIntPtrType is assumed equivalent to size_t.
+    Args.push_back(ConstantInt::get(DL.getIntPtrType(Ctx), Size));
+  }
+
+  // 'ptr' argument.
+  Value *PtrVal =
+      Builder.CreateBitCast(PointerOperand, Type::getInt8PtrTy(Ctx));
+  Args.push_back(PtrVal);
+
+  // 'expected' argument, if present.
+  if (CASExpected) {
+    AllocaCASExpected =
+        AllocaBuilder.CreateAlloca(CASExpected->getType(), nullptr, "");
+    AllocaCASExpected->setAlignment(AllocaAlignment);
+    AllocaCASExpected_i8 =
+        Builder.CreateBitCast(AllocaCASExpected, Type::getInt8PtrTy(Ctx));
+    Builder.CreateLifetimeStart(AllocaCASExpected_i8, SizeVal64);
+    Builder.CreateAlignedStore(CASExpected, AllocaCASExpected, AllocaAlignment);
+    Args.push_back(AllocaCASExpected_i8);
+  }
+
+  // 'val' argument ('desired' for cas), if present.
+  if (ValueOperand) {
+    if (UseSizedLibcall) {
+      Value *IntValue =
+          Builder.CreateBitOrPointerCast(ValueOperand, SizedIntTy);
+      Args.push_back(IntValue);
+    } else {
+      AllocaValue =
+          AllocaBuilder.CreateAlloca(ValueOperand->getType(), nullptr, "");
+      AllocaValue->setAlignment(AllocaAlignment);
+      AllocaValue_i8 =
+          Builder.CreateBitCast(AllocaValue, Type::getInt8PtrTy(Ctx));
+      Builder.CreateLifetimeStart(AllocaValue_i8, SizeVal64);
+      Builder.CreateAlignedStore(ValueOperand, AllocaValue, AllocaAlignment);
+      Args.push_back(AllocaValue_i8);
+    }
+  }
+
+  // 'ret' argument.
+  if (!CASExpected && HasResult && !UseSizedLibcall) {
+    AllocaResult = AllocaBuilder.CreateAlloca(I->getType(), nullptr, "");
+    AllocaResult->setAlignment(AllocaAlignment);
+    AllocaResult_i8 =
+        Builder.CreateBitCast(AllocaResult, Type::getInt8PtrTy(Ctx));
+    Builder.CreateLifetimeStart(AllocaResult_i8, SizeVal64);
+    Args.push_back(AllocaResult_i8);
+  }
+
+  // 'ordering' ('success_order' for cas) argument.
+  Args.push_back(OrderingVal);
+
+  // 'failure_order' argument, if present.
+  if (Ordering2Val)
+    Args.push_back(Ordering2Val);
+
+  // Now, the return type.
+  if (CASExpected) {
+    ResultTy = Type::getInt1Ty(Ctx);
+    Attr = Attr.addAttribute(Ctx, AttributeSet::ReturnIndex, Attribute::ZExt);
+  } else if (HasResult && UseSizedLibcall)
+    ResultTy = SizedIntTy;
+  else
+    ResultTy = Type::getVoidTy(Ctx);
+
+  // Done with setting up arguments and return types, create the call:
+  SmallVector<Type *, 6> ArgTys;
+  for (Value *Arg : Args)
+    ArgTys.push_back(Arg->getType());
+  FunctionType *FnType = FunctionType::get(ResultTy, ArgTys, false);
+  Constant *LibcallFn =
+      M->getOrInsertFunction(TLI->getLibcallName(RTLibType), FnType, Attr);
+  CallInst *Call = Builder.CreateCall(LibcallFn, Args);
+  Call->setAttributes(Attr);
+  Value *Result = Call;
+
+  // And then, extract the results...
+  if (ValueOperand && !UseSizedLibcall)
+    Builder.CreateLifetimeEnd(AllocaValue_i8, SizeVal64);
+
+  if (CASExpected) {
+    // The final result from the CAS is {load of 'expected' alloca, bool result
+    // from call}
+    Type *FinalResultTy = I->getType();
+    Value *V = UndefValue::get(FinalResultTy);
+    Value *ExpectedOut =
+        Builder.CreateAlignedLoad(AllocaCASExpected, AllocaAlignment);
+    Builder.CreateLifetimeEnd(AllocaCASExpected_i8, SizeVal64);
+    V = Builder.CreateInsertValue(V, ExpectedOut, 0, "");
+    V = Builder.CreateInsertValue(V, Result, 1, "");
+    I->replaceAllUsesWith(V);
+  } else if (HasResult) {
+    Value *V;
+    if (UseSizedLibcall)
+      V = Builder.CreateBitOrPointerCast(Result, I->getType());
+    else {
+      V = Builder.CreateAlignedLoad(AllocaResult, AllocaAlignment);
+      Builder.CreateLifetimeEnd(AllocaResult_i8, SizeVal64);
+    }
+    I->replaceAllUsesWith(V);
+  }
+  I->eraseFromParent();
+  return true;
+}
Index: lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -4035,7 +4035,7 @@
   case ISD::ATOMIC_LOAD_UMAX:
   case ISD::ATOMIC_CMP_SWAP: {
     MVT VT = cast<AtomicSDNode>(Node)->getMemoryVT().getSimpleVT();
-    RTLIB::Libcall LC = RTLIB::getATOMIC(Opc, VT);
+    RTLIB::Libcall LC = RTLIB::getSYNC(Opc, VT);
     assert(LC != RTLIB::UNKNOWN_LIBCALL && "Unexpected atomic op or value type!");
 
     std::pair<SDValue, SDValue> Tmp = ExpandChainLibCall(LC, Node, false);
Index: lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -1404,7 +1404,7 @@
 std::pair <SDValue, SDValue> DAGTypeLegalizer::ExpandAtomic(SDNode *Node) {
   unsigned Opc = Node->getOpcode();
   MVT VT = cast<AtomicSDNode>(Node)->getMemoryVT().getSimpleVT();
-  RTLIB::Libcall LC = RTLIB::getATOMIC(Opc, VT);
+  RTLIB::Libcall LC = RTLIB::getSYNC(Opc, VT);
   assert(LC != RTLIB::UNKNOWN_LIBCALL && "Unexpected atomic op or value type!");
 
   return ExpandChainLibCall(LC, Node, false);
Index: lib/CodeGen/TargetLoweringBase.cpp
===================================================================
--- lib/CodeGen/TargetLoweringBase.cpp
+++ lib/CodeGen/TargetLoweringBase.cpp
@@ -405,7 +405,66 @@
   Names[RTLIB::SYNC_FETCH_AND_UMIN_4] = "__sync_fetch_and_umin_4";
   Names[RTLIB::SYNC_FETCH_AND_UMIN_8] = "__sync_fetch_and_umin_8";
   Names[RTLIB::SYNC_FETCH_AND_UMIN_16] = "__sync_fetch_and_umin_16";
-  
+
+  Names[RTLIB::ATOMIC_LOAD] = "__atomic_load";
+  Names[RTLIB::ATOMIC_LOAD_1] = "__atomic_load_1";
+  Names[RTLIB::ATOMIC_LOAD_2] = "__atomic_load_2";
+  Names[RTLIB::ATOMIC_LOAD_4] = "__atomic_load_4";
+  Names[RTLIB::ATOMIC_LOAD_8] = "__atomic_load_8";
+  Names[RTLIB::ATOMIC_LOAD_16] = "__atomic_load_16";
+
+  Names[RTLIB::ATOMIC_STORE] = "__atomic_store";
+  Names[RTLIB::ATOMIC_STORE_1] = "__atomic_store_1";
+  Names[RTLIB::ATOMIC_STORE_2] = "__atomic_store_2";
+  Names[RTLIB::ATOMIC_STORE_4] = "__atomic_store_4";
+  Names[RTLIB::ATOMIC_STORE_8] = "__atomic_store_8";
+  Names[RTLIB::ATOMIC_STORE_16] = "__atomic_store_16";
+
+  Names[RTLIB::ATOMIC_EXCHANGE] = "__atomic_exchange";
+  Names[RTLIB::ATOMIC_EXCHANGE_1] = "__atomic_exchange_1";
+  Names[RTLIB::ATOMIC_EXCHANGE_2] = "__atomic_exchange_2";
+  Names[RTLIB::ATOMIC_EXCHANGE_4] = "__atomic_exchange_4";
+  Names[RTLIB::ATOMIC_EXCHANGE_8] = "__atomic_exchange_8";
+  Names[RTLIB::ATOMIC_EXCHANGE_16] = "__atomic_exchange_16";
+
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE] = "__atomic_compare_exchange";
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_1] = "__atomic_compare_exchange_1";
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_2] = "__atomic_compare_exchange_2";
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_4] = "__atomic_compare_exchange_4";
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_8] = "__atomic_compare_exchange_8";
+  Names[RTLIB::ATOMIC_COMPARE_EXCHANGE_16] = "__atomic_compare_exchange_16";
+
+  Names[RTLIB::ATOMIC_FETCH_ADD_1] = "__atomic_fetch_add_1";
+  Names[RTLIB::ATOMIC_FETCH_ADD_2] = "__atomic_fetch_add_2";
+  Names[RTLIB::ATOMIC_FETCH_ADD_4] = "__atomic_fetch_add_4";
+  Names[RTLIB::ATOMIC_FETCH_ADD_8] = "__atomic_fetch_add_8";
+  Names[RTLIB::ATOMIC_FETCH_ADD_16] = "__atomic_fetch_add_16";
+  Names[RTLIB::ATOMIC_FETCH_SUB_1] = "__atomic_fetch_sub_1";
+  Names[RTLIB::ATOMIC_FETCH_SUB_2] = "__atomic_fetch_sub_2";
+  Names[RTLIB::ATOMIC_FETCH_SUB_4] = "__atomic_fetch_sub_4";
+  Names[RTLIB::ATOMIC_FETCH_SUB_8] = "__atomic_fetch_sub_8";
+  Names[RTLIB::ATOMIC_FETCH_SUB_16] = "__atomic_fetch_sub_16";
+  Names[RTLIB::ATOMIC_FETCH_AND_1] = "__atomic_fetch_and_1";
+  Names[RTLIB::ATOMIC_FETCH_AND_2] = "__atomic_fetch_and_2";
+  Names[RTLIB::ATOMIC_FETCH_AND_4] = "__atomic_fetch_and_4";
+  Names[RTLIB::ATOMIC_FETCH_AND_8] = "__atomic_fetch_and_8";
+  Names[RTLIB::ATOMIC_FETCH_AND_16] = "__atomic_fetch_and_16";
+  Names[RTLIB::ATOMIC_FETCH_OR_1] = "__atomic_fetch_or_1";
+  Names[RTLIB::ATOMIC_FETCH_OR_2] = "__atomic_fetch_or_2";
+  Names[RTLIB::ATOMIC_FETCH_OR_4] = "__atomic_fetch_or_4";
+  Names[RTLIB::ATOMIC_FETCH_OR_8] = "__atomic_fetch_or_8";
+  Names[RTLIB::ATOMIC_FETCH_OR_16] = "__atomic_fetch_or_16";
+  Names[RTLIB::ATOMIC_FETCH_XOR_1] = "__atomic_fetch_xor_1";
+  Names[RTLIB::ATOMIC_FETCH_XOR_2] = "__atomic_fetch_xor_2";
+  Names[RTLIB::ATOMIC_FETCH_XOR_4] = "__atomic_fetch_xor_4";
+  Names[RTLIB::ATOMIC_FETCH_XOR_8] = "__atomic_fetch_xor_8";
+  Names[RTLIB::ATOMIC_FETCH_XOR_16] = "__atomic_fetch_xor_16";
+  Names[RTLIB::ATOMIC_FETCH_NAND_1] = "__atomic_fetch_nand_1";
+  Names[RTLIB::ATOMIC_FETCH_NAND_2] = "__atomic_fetch_nand_2";
+  Names[RTLIB::ATOMIC_FETCH_NAND_4] = "__atomic_fetch_nand_4";
+  Names[RTLIB::ATOMIC_FETCH_NAND_8] = "__atomic_fetch_nand_8";
+  Names[RTLIB::ATOMIC_FETCH_NAND_16] = "__atomic_fetch_nand_16";
+
   if (TT.getEnvironment() == Triple::GNU) {
     Names[RTLIB::SINCOS_F32] = "sincosf";
     Names[RTLIB::SINCOS_F64] = "sincos";
@@ -667,7 +726,7 @@
   return UNKNOWN_LIBCALL;
 }
 
-RTLIB::Libcall RTLIB::getATOMIC(unsigned Opc, MVT VT) {
+RTLIB::Libcall RTLIB::getSYNC(unsigned Opc, MVT VT) {
 #define OP_TO_LIBCALL(Name, Enum)                                              \
   case Name:                                                                   \
     switch (VT.SimpleTy) {                                                     \
@@ -774,8 +833,10 @@
   PrefLoopAlignment = 0;
   GatherAllAliasesMaxDepth = 6;
   MinStackArgumentAlignment = 1;
-  InsertFencesForAtomic = false;
   MinimumJumpTableEntries = 4;
+  // TODO: the default will be switched to 0 in the next commit, along
+  // with the Target-specific changes necessary.
+  MaxAtomicSizeSupported = 1024;
 
   InitLibcallNames(LibcallRoutineNames, TM.getTargetTriple());
   InitCmpLibcallCCs(CmpLibcallCCs);
Index: lib/Target/ARM/ARMISelLowering.h
===================================================================
--- lib/Target/ARM/ARMISelLowering.h
+++ lib/Target/ARM/ARMISelLowering.h
@@ -453,6 +453,7 @@
     bool lowerInterleavedStore(StoreInst *SI, ShuffleVectorInst *SVI,
                                unsigned Factor) const override;
 
+    bool shouldInsertFencesForAtomic(const Instruction *I) const override;
     TargetLoweringBase::AtomicExpansionKind
     shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
     bool shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
@@ -486,6 +487,10 @@
     ///
     unsigned ARMPCLabelIndex;
 
+    // TODO: remove this, and have shouldInsertFencesForAtomic do the proper
+    // check.
+    bool InsertFencesForAtomic;
+
     void addTypeForNEON(MVT VT, MVT PromotedLdStVT, MVT PromotedBitwiseVT);
     void addDRTypeForNEON(MVT VT);
     void addQRTypeForNEON(MVT VT);
Index: lib/Target/ARM/ARMISelLowering.cpp
===================================================================
--- lib/Target/ARM/ARMISelLowering.cpp
+++ lib/Target/ARM/ARMISelLowering.cpp
@@ -840,6 +840,7 @@
   // the default expansion. If we are targeting a single threaded system,
   // then set them all for expand so we can lower them later into their
   // non-atomic form.
+  InsertFencesForAtomic = false;
   if (TM.Options.ThreadModel == ThreadModel::Single)
     setOperationAction(ISD::ATOMIC_FENCE,   MVT::Other, Expand);
   else if (Subtarget->hasAnyDataBarrier() && (!Subtarget->isThumb() ||
@@ -852,7 +853,7 @@
     // if they can be combined with nearby atomic loads and stores.
     if (!Subtarget->hasV8Ops()) {
       // Automatically insert fences (dmb ish) around ATOMIC_SWAP etc.
-      setInsertFencesForAtomic(true);
+      InsertFencesForAtomic = true;
     }
   } else {
     // If there's anything we can use as a barrier, go through custom lowering
@@ -11997,9 +11998,6 @@
 Instruction* ARMTargetLowering::emitLeadingFence(IRBuilder<> &Builder,
                                          AtomicOrdering Ord, bool IsStore,
                                          bool IsLoad) const {
-  if (!getInsertFencesForAtomic())
-    return nullptr;
-
   switch (Ord) {
   case NotAtomic:
   case Unordered:
@@ -12025,9 +12023,6 @@
 Instruction* ARMTargetLowering::emitTrailingFence(IRBuilder<> &Builder,
                                           AtomicOrdering Ord, bool IsStore,
                                           bool IsLoad) const {
-  if (!getInsertFencesForAtomic())
-    return nullptr;
-
   switch (Ord) {
   case NotAtomic:
   case Unordered:
@@ -12081,6 +12076,11 @@
   return true;
 }
 
+bool ARMTargetLowering::shouldInsertFencesForAtomic(
+    const Instruction *I) const {
+  return InsertFencesForAtomic;
+}
+
 // This has so far only been implemented for MachO.
 bool ARMTargetLowering::useLoadStackGuardNode() const {
   return Subtarget->isTargetMachO();
Index: lib/Target/Hexagon/HexagonISelLowering.cpp
===================================================================
--- lib/Target/Hexagon/HexagonISelLowering.cpp
+++ lib/Target/Hexagon/HexagonISelLowering.cpp
@@ -1724,7 +1724,6 @@
   setPrefLoopAlignment(4);
   setPrefFunctionAlignment(4);
   setMinFunctionAlignment(2);
-  setInsertFencesForAtomic(false);
   setStackPointerRegisterToSaveRestore(HRI.getStackRegister());
 
   if (EnableHexSDNodeSched)
Index: lib/Target/Mips/MipsISelLowering.h
===================================================================
--- lib/Target/Mips/MipsISelLowering.h
+++ lib/Target/Mips/MipsISelLowering.h
@@ -561,6 +561,10 @@
     unsigned getJumpTableEncoding() const override;
     bool useSoftFloat() const override;
 
+    bool shouldInsertFencesForAtomic(const Instruction *I) const override {
+      return true;
+    }
+
     /// Emit a sign-extension using sll/sra, seb, or seh appropriately.
     MachineBasicBlock *emitSignExtendToI32InReg(MachineInstr *MI,
                                                 MachineBasicBlock *BB,
Index: lib/Target/Mips/MipsISelLowering.cpp
===================================================================
--- lib/Target/Mips/MipsISelLowering.cpp
+++ lib/Target/Mips/MipsISelLowering.cpp
@@ -396,7 +396,6 @@
     setOperationAction(ISD::ATOMIC_STORE,    MVT::i64,   Expand);
   }
 
-  setInsertFencesForAtomic(true);
 
   if (!Subtarget.hasMips32r2()) {
     setOperationAction(ISD::SIGN_EXTEND_INREG, MVT::i8,  Expand);
Index: lib/Target/PowerPC/PPCISelLowering.h
===================================================================
--- lib/Target/PowerPC/PPCISelLowering.h
+++ lib/Target/PowerPC/PPCISelLowering.h
@@ -508,6 +508,10 @@
 
     unsigned getPrefLoopAlignment(MachineLoop *ML) const override;
 
+    bool shouldInsertFencesForAtomic(const Instruction *I) const override {
+      return true;
+    }
+
     Instruction* emitLeadingFence(IRBuilder<> &Builder, AtomicOrdering Ord,
                                   bool IsStore, bool IsLoad) const override;
     Instruction* emitTrailingFence(IRBuilder<> &Builder, AtomicOrdering Ord,
Index: lib/Target/PowerPC/PPCISelLowering.cpp
===================================================================
--- lib/Target/PowerPC/PPCISelLowering.cpp
+++ lib/Target/PowerPC/PPCISelLowering.cpp
@@ -916,7 +916,6 @@
     break;
   }
 
-  setInsertFencesForAtomic(true);
 
   if (Subtarget.enableMachineScheduler())
     setSchedulingPreference(Sched::Source);
Index: lib/Target/Sparc/SparcISelLowering.h
===================================================================
--- lib/Target/Sparc/SparcISelLowering.h
+++ lib/Target/Sparc/SparcISelLowering.h
@@ -180,6 +180,13 @@
       return VT != MVT::f128;
     }
 
+    bool shouldInsertFencesForAtomic(const Instruction *I) const override {
+      // FIXME: We insert fences for each atomics and generate
+      // sub-optimal code for PSO/TSO. (Approximately nobody uses any
+      // mode but TSO, which makes this even more silly)
+      return true;
+    }
+
     void ReplaceNodeResults(SDNode *N,
                             SmallVectorImpl<SDValue>& Results,
                             SelectionDAG &DAG) const override;
Index: lib/Target/Sparc/SparcISelLowering.cpp
===================================================================
--- lib/Target/Sparc/SparcISelLowering.cpp
+++ lib/Target/Sparc/SparcISelLowering.cpp
@@ -1603,10 +1603,13 @@
   }
 
   // ATOMICs.
-  // FIXME: We insert fences for each atomics and generate sub-optimal code
-  // for PSO/TSO. Also, implement other atomicrmw operations.
-
-  setInsertFencesForAtomic(true);
+  // Atomics are only supported on Sparcv9. (32bit atomics are also
+  // supported by the Leon sparcv8 variant, but we don't support that
+  // yet.)
+  if (Subtarget->isV9())
+    setMaxAtomicSizeSupported(64);
+  else
+    setMaxAtomicSizeSupported(0);
 
   setOperationAction(ISD::ATOMIC_SWAP, MVT::i32, Legal);
   setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32,
Index: lib/Target/XCore/XCoreISelLowering.h
===================================================================
--- lib/Target/XCore/XCoreISelLowering.h
+++ lib/Target/XCore/XCoreISelLowering.h
@@ -229,6 +229,9 @@
                      bool isVarArg,
                      const SmallVectorImpl<ISD::OutputArg> &ArgsFlags,
                      LLVMContext &Context) const override;
+    bool shouldInsertFencesForAtomic(const Instruction *I) const override {
+      return true;
+    }
   };
 }
 
Index: lib/Target/XCore/XCoreISelLowering.cpp
===================================================================
--- lib/Target/XCore/XCoreISelLowering.cpp
+++ lib/Target/XCore/XCoreISelLowering.cpp
@@ -156,7 +156,6 @@
   // Atomic operations
   // We request a fence for ATOMIC_* instructions, to reduce them to Monotonic.
   // As we are always Sequential Consistent, an ATOMIC_FENCE becomes a no OP.
-  setInsertFencesForAtomic(true);
   setOperationAction(ISD::ATOMIC_FENCE, MVT::Other, Custom);
   setOperationAction(ISD::ATOMIC_LOAD, MVT::i32, Custom);
   setOperationAction(ISD::ATOMIC_STORE, MVT::i32, Custom);
Index: test/Transforms/AtomicExpand/SPARC/libcalls.ll
===================================================================
--- /dev/null
+++ test/Transforms/AtomicExpand/SPARC/libcalls.ll
@@ -0,0 +1,217 @@
+; RUN: opt -S %s -atomic-expand | FileCheck %s
+
+;;; NOTE: this test is actually target-independent -- any target which
+;;; doesn't support inline atomics can be used. (E.g. X86 i386 would
+;;; work, if LLVM is properly taught about what it's missing vs i586.)
+
+;target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
+;target triple = "i386-unknown-unknown"
+target datalayout = "e-m:e-p:32:32-i64:64-f128:64-n32-S64"
+target triple = "sparc-unknown-unknown"
+
+;; First, check the sized calls. Except for cmpxchg, these are fairly
+;; straightforward.
+
+; CHECK-LABEL: @test_load_i16(
+; CHECK:  %1 = bitcast i16* %arg to i8*
+; CHECK:  %2 = call i16 @__atomic_load_2(i8* %1, i32 5)
+; CHECK:  ret i16 %2
+define i16 @test_load_i16(i16* %arg) {
+  %ret = load atomic i16, i16* %arg seq_cst, align 4
+  ret i16 %ret
+}
+
+; CHECK-LABEL: @test_store_i16(
+; CHECK:  %1 = bitcast i16* %arg to i8*
+; CHECK:  call void @__atomic_store_2(i8* %1, i16 %val, i32 5)
+; CHECK:  ret void
+define void @test_store_i16(i16* %arg, i16 %val) {
+  store atomic i16 %val, i16* %arg seq_cst, align 4
+  ret void
+}
+
+; CHECK-LABEL: @test_exchange_i16(
+; CHECK:  %1 = bitcast i16* %arg to i8*
+; CHECK:  %2 = call i16 @__atomic_exchange_2(i8* %1, i16 %val, i32 5)
+; CHECK:  ret i16 %2
+define i16 @test_exchange_i16(i16* %arg, i16 %val) {
+  %ret = atomicrmw xchg i16* %arg, i16 %val seq_cst
+  ret i16 %ret
+}
+
+; CHECK-LABEL: @test_cmpxchg_i16(
+; CHECK:  %1 = bitcast i16* %arg to i8*
+; CHECK:  %2 = alloca i16, align 2
+; CHECK:  %3 = bitcast i16* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 2, i8* %3)
+; CHECK:  store i16 %old, i16* %2, align 2
+; CHECK:  %4 = call zeroext i1 @__atomic_compare_exchange_2(i8* %1, i8* %3, i16 %new, i32 5, i32 0)
+; CHECK:  %5 = load i16, i16* %2, align 2
+; CHECK:  call void @llvm.lifetime.end(i64 2, i8* %3)
+; CHECK:  %6 = insertvalue { i16, i1 } undef, i16 %5, 0
+; CHECK:  %7 = insertvalue { i16, i1 } %6, i1 %4, 1
+; CHECK:  %ret = extractvalue { i16, i1 } %7, 0
+; CHECK:  ret i16 %ret
+define i16 @test_cmpxchg_i16(i16* %arg, i16 %old, i16 %new) {
+  %ret_succ = cmpxchg i16* %arg, i16 %old, i16 %new seq_cst monotonic
+  %ret = extractvalue { i16, i1 } %ret_succ, 0
+  ret i16 %ret
+}
+
+; CHECK-LABEL: @test_add_i16(
+; CHECK:  %1 = bitcast i16* %arg to i8*
+; CHECK:  %2 = call i16 @__atomic_fetch_add_2(i8* %1, i16 %val, i32 5)
+; CHECK:  ret i16 %2
+define i16 @test_add_i16(i16* %arg, i16 %val) {
+  %ret = atomicrmw add i16* %arg, i16 %val seq_cst
+  ret i16 %ret
+}
+
+
+;; Now, check the output for the unsized libcalls. i128 is used for
+;; these tests because the "16" suffixed functions aren't available on
+;; 32-bit i386.
+
+; CHECK-LABEL: @test_load_i128(
+; CHECK:  %1 = bitcast i128* %arg to i8*
+; CHECK:  %2 = alloca i128, align 16
+; CHECK:  %3 = bitcast i128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %3)
+; CHECK:  call void @__atomic_load(i32 16, i8* %1, i8* %3, i32 5)
+; CHECK:  %4 = load i128, i128* %2, align 16
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %3)
+; CHECK:  ret i128 %4
+define i128 @test_load_i128(i128* %arg) {
+  %ret = load atomic i128, i128* %arg seq_cst, align 16
+  ret i128 %ret
+}
+
+; CHECK-LABEL @test_store_i128(
+; CHECK:  %1 = bitcast i128* %arg to i8*
+; CHECK:  %2 = alloca i128, align 16
+; CHECK:  %3 = bitcast i128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %3)
+; CHECK:  store i128 %val, i128* %2, align 16
+; CHECK:  call void @__atomic_store(i32 16, i8* %1, i8* %3, i32 5)
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %3)
+; CHECK:  ret void
+define void @test_store_i128(i128* %arg, i128 %val) {
+  store atomic i128 %val, i128* %arg seq_cst, align 16
+  ret void
+}
+
+; CHECK-LABEL: @test_exchange_i128(
+; CHECK:  %1 = bitcast i128* %arg to i8*
+; CHECK:  %2 = alloca i128, align 16
+; CHECK:  %3 = bitcast i128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %3)
+; CHECK:  store i128 %val, i128* %2, align 16
+; CHECK:  %4 = alloca i128, align 16
+; CHECK:  %5 = bitcast i128* %4 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %5)
+; CHECK:  call void @__atomic_exchange(i32 16, i8* %1, i8* %3, i8* %5, i32 5)
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %3)
+; CHECK:  %6 = load i128, i128* %4, align 16
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %5)
+; CHECK:  ret i128 %6
+define i128 @test_exchange_i128(i128* %arg, i128 %val) {
+  %ret = atomicrmw xchg i128* %arg, i128 %val seq_cst
+  ret i128 %ret
+}
+
+; CHECK-LABEL: @test_cmpxchg_i128(
+; CHECK:  %1 = bitcast i128* %arg to i8*
+; CHECK:  %2 = alloca i128, align 16
+; CHECK:  %3 = bitcast i128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %3)
+; CHECK:  store i128 %old, i128* %2, align 16
+; CHECK:  %4 = alloca i128, align 16
+; CHECK:  %5 = bitcast i128* %4 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %5)
+; CHECK:  store i128 %new, i128* %4, align 16
+; CHECK:  %6 = call zeroext i1 @__atomic_compare_exchange(i32 16, i8* %1, i8* %3, i8* %5, i32 5, i32 0)
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %5)
+; CHECK:  %7 = load i128, i128* %2, align 16
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %3)
+; CHECK:  %8 = insertvalue { i128, i1 } undef, i128 %7, 0
+; CHECK:  %9 = insertvalue { i128, i1 } %8, i1 %6, 1
+; CHECK:  %ret = extractvalue { i128, i1 } %9, 0
+; CHECK:  ret i128 %ret
+define i128 @test_cmpxchg_i128(i128* %arg, i128 %old, i128 %new) {
+  %ret_succ = cmpxchg i128* %arg, i128 %old, i128 %new seq_cst monotonic
+  %ret = extractvalue { i128, i1 } %ret_succ, 0
+  ret i128 %ret
+}
+
+; This one is a verbose expansion, as there is no generic
+; __atomic_fetch_add function, so it needs to expand to a cmpxchg
+; loop, which then itself expands into a libcall.
+
+; CHECK-LABEL: @test_add_i128(
+; CHECK:  %1 = alloca i128, align 16
+; CHECK:  %2 = alloca i128, align 16
+; CHECK:  %3 = load i128, i128* %arg, align 16
+; CHECK:  br label %atomicrmw.start
+; CHECK:atomicrmw.start:
+; CHECK:  %loaded = phi i128 [ %3, %0 ], [ %newloaded, %atomicrmw.start ]
+; CHECK:  %new = add i128 %loaded, %val
+; CHECK:  %4 = bitcast i128* %arg to i8*
+; CHECK:  %5 = bitcast i128* %1 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %5)
+; CHECK:  store i128 %loaded, i128* %1, align 16
+; CHECK:  %6 = bitcast i128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %6)
+; CHECK:  store i128 %new, i128* %2, align 16
+; CHECK:  %7 = call zeroext i1 @__atomic_compare_exchange(i32 16, i8* %4, i8* %5, i8* %6, i32 5, i32 5)
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %6)
+; CHECK:  %8 = load i128, i128* %1, align 16
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %5)
+; CHECK:  %9 = insertvalue { i128, i1 } undef, i128 %8, 0
+; CHECK:  %10 = insertvalue { i128, i1 } %9, i1 %7, 1
+; CHECK:  %success = extractvalue { i128, i1 } %10, 1
+; CHECK:  %newloaded = extractvalue { i128, i1 } %10, 0
+; CHECK:  br i1 %success, label %atomicrmw.end, label %atomicrmw.start
+; CHECK:atomicrmw.end:
+; CHECK:  ret i128 %newloaded
+define i128 @test_add_i128(i128* %arg, i128 %val) {
+  %ret = atomicrmw add i128* %arg, i128 %val seq_cst
+  ret i128 %ret
+}
+
+;; Ensure that non-integer types get bitcast correctly on the way in and out of a libcall:
+
+; CHECK-LABEL: @test_load_double(
+; CHECK:  %1 = bitcast double* %arg to i8*
+; CHECK:  %2 = call i64 @__atomic_load_8(i8* %1, i32 5)
+; CHECK:  %3 = bitcast i64 %2 to double
+; CHECK:  ret double %3
+define double @test_load_double(double* %arg, double %val) {
+  %1 = load atomic double, double* %arg seq_cst, align 16
+  ret double %1
+}
+
+; CHECK-LABEL: @test_store_double(
+; CHECK:  %1 = bitcast double* %arg to i8*
+; CHECK:  %2 = bitcast double %val to i64
+; CHECK:  call void @__atomic_store_8(i8* %1, i64 %2, i32 5)
+; CHECK:  ret void
+define void @test_store_double(double* %arg, double %val) {
+  store atomic double %val, double* %arg seq_cst, align 16
+  ret void
+}
+
+;; ...and for a non-integer type of large size too.
+
+; CHECK-LABEL: @test_store_fp128
+; CHECK:   %1 = bitcast fp128* %arg to i8*
+; CHECK:  %2 = alloca fp128, align 16
+; CHECK:  %3 = bitcast fp128* %2 to i8*
+; CHECK:  call void @llvm.lifetime.start(i64 16, i8* %3)
+; CHECK:  store fp128 %val, fp128* %2, align 16
+; CHECK:  call void @__atomic_store(i32 16, i8* %1, i8* %3, i32 5)
+; CHECK:  call void @llvm.lifetime.end(i64 16, i8* %3)
+; CHECK:  ret void
+define void @test_store_fp128(fp128* %arg, fp128 %val) {
+  store atomic fp128 %val, fp128* %arg seq_cst, align 16
+  ret void
+}