Index: docs/LangRef.rst
===================================================================
--- docs/LangRef.rst
+++ docs/LangRef.rst
@@ -13553,62 +13553,70 @@
 These intrinsics are similar to the standard library memory intrinsics except
 that they perform memory transfer as a sequence of atomic memory accesses.
 
-.. _int_memcpy_element_atomic:
+.. _int_memcpy_element_unordered_atomic:
 
-'``llvm.memcpy.element.atomic``' Intrinsic
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+'``llvm.memcpy.element.unordered.atomic``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
 """""""
 
-This is an overloaded intrinsic. You can use ``llvm.memcpy.element.atomic`` on
+This is an overloaded intrinsic. You can use ``llvm.memcpy.element.unordered.atomic`` on
 any integer bit width and for different address spaces. Not all targets
 support all bit widths however.
 
 ::
 
-      declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* <dest>, i8* <src>,
-                                              i64 <num_elements>, i32 <element_size>)
+      declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* <dest>, i8* <src>, i32 <len>,
+                                                                       i32 <align>, i1 <isvolatile>,
+                                                                       i1 <dest_unordered>, i1 <src_unordered>,
+                                                                       i8 <element_size>)
+      declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i64(i8* <dest>, i8* <src>, i64 <len>,
+                                                                       i32 <align>, i1 <isvolatile>,
+                                                                       i1 <dest_unordered>, i1 <src_unordered>,
+                                                                       i8 <element_size>)
 
 Overview:
 """""""""
 
-The '``llvm.memcpy.element.atomic.*``' intrinsic performs copy of a block of 
-memory from the source location to the destination location as a sequence of
-unordered atomic memory accesses where each access is a multiple of
-``element_size`` bytes wide and aligned at an element size boundary. For example
-each element is accessed atomically in source and destination buffers.
+The '``llvm.memcpy.element.unordered.atomic.*``' intrinsic is a specialization of the '``llvm.memcpy.*``'
+intrinsic. It differs in that the ``dest`` and ``src`` are treated as arrays with elements that are
+exactly ``element_size`` bytes, and the copy between buffers is done in a way that uses
+:ref:`unordered atomic <ordering>` load/store operations  that are a positive integer multiple
+of the ``element_size`` in size.
 
 Arguments:
 """"""""""
 
-The first argument is a pointer to the destination, the second is a
-pointer to the source. The third argument is an integer argument
-specifying the number of elements to copy, the fourth argument is size of
-the single element in bytes.
+The first five arguments are the same as they are in the :ref:`@llvm.memcpy <int_memcpy>` intrinsic,
+with the added constraint that ``len`` must be a positive integer multiple of the ``element_size``.
 
-``element_size`` should be a power of two, greater than zero and less than
-a target-specific atomic access size limit.
+``dest_unordered`` is ``true`` if and only if stores to the destination buffer must be unordered
+atomic stores.
 
-For each of the input pointers ``align`` parameter attribute must be specified.
-It must be a power of two and greater than or equal to the ``element_size``.
-Caller guarantees that both the source and destination pointers are aligned to
-that boundary.
+``src_unordered`` is ``true`` if and only if loads from the source buffer must be unordered atomic
+loads.
+
+``element_size`` must be a compile-time constant positive power of two no greater than target-specific
+atomic access size limit.
+
+For each of the input pointers ``align`` parameter attribute must be specified. It must be a power of
+two and greater than or equal to the ``element_size``. Caller guarantees that both the source and
+destination pointers are aligned to that boundary.
 
 Semantics:
 """"""""""
 
-The '``llvm.memcpy.element.atomic.*``' intrinsic copies
-'``num_elements`` * ``element_size``' bytes of memory from the source location to
-the destination location. These locations are not allowed to overlap. Memory copy
-is performed as a sequence of unordered atomic memory accesses where each access
-is guaranteed to be a multiple of ``element_size`` bytes wide and aligned at an
-element size boundary.
+The '``llvm.memcpy.element.unordered.atomic.*``' intrinsic copies ``len`` bytes of memory from
+the source location to the destination location. These locations are not allowed to overlap.
+The memory copy is performed as a sequence of load/store operations where each access is
+guaranteed to be a multiple of ``element_size`` bytes wide and aligned at an ``element_size``
+boundary. 
 
 The order of the copy is unspecified. The same value may be read from the source
 buffer many times, but only one write is issued to the destination buffer per
-element. It is well defined to have concurrent reads and writes to both source
-and destination provided those reads and writes are at least unordered atomic.
+element. It is well defined to have concurrent reads and writes to both source and destination
+provided those reads and writes are unordered atomic when specified.
 
 This intrinsic does not provide any additional ordering guarantees over those
 provided by a set of unordered loads from the source location and stores to the
@@ -13617,8 +13625,8 @@
 Lowering:
 """""""""
 
-In the most general case call to the '``llvm.memcpy.element.atomic.*``' is lowered
-to a call to the symbol ``__llvm_memcpy_element_atomic_*``. Where '*' is replaced
+In the most general case call to the '``llvm.memcpy.element.unordered.atomic.*``' is lowered
+to a call to the symbol ``__llvm_memcpy_element_unordered_atomic_*``. Where '*' is replaced
 with an actual element size.
 
-Optimizer is allowed to inline memory copy when it's profitable to do so.
+The optimizer is allowed to inline the memory copy when it's profitable to do so.
Index: include/llvm/CodeGen/RuntimeLibcalls.h
===================================================================
--- include/llvm/CodeGen/RuntimeLibcalls.h
+++ include/llvm/CodeGen/RuntimeLibcalls.h
@@ -333,12 +333,12 @@
     MEMSET,
     MEMMOVE,
 
-    // ELEMENT-WISE ATOMIC MEMORY
-    MEMCPY_ELEMENT_ATOMIC_1,
-    MEMCPY_ELEMENT_ATOMIC_2,
-    MEMCPY_ELEMENT_ATOMIC_4,
-    MEMCPY_ELEMENT_ATOMIC_8,
-    MEMCPY_ELEMENT_ATOMIC_16,
+    // ELEMENT-WISE UNORDERED-ATOMIC MEMORY of different element sizes
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_1,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_2,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_4,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_8,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_16,
 
     // EXCEPTION HANDLING
     UNWIND_RESUME,
@@ -511,9 +511,9 @@
   /// UNKNOWN_LIBCALL if there is none.
   Libcall getSYNC(unsigned Opc, MVT VT);
 
-  /// getMEMCPY_ELEMENT_ATOMIC - Return MEMCPY_ELEMENT_ATOMIC_* value for the
+  /// getMEMCPY_ELEMENT_UNORDERED_ATOMIC - Return MEMCPY_ELEMENT_UNORDERED_ATOMIC_* value for the
   /// given element size or UNKNOW_LIBCALL if there is none.
-  Libcall getMEMCPY_ELEMENT_ATOMIC(uint64_t ElementSize);
+  Libcall getMEMCPY_ELEMENT_UNORDERED_ATOMIC(uint64_t ElementSize);
 }
 }
 
Index: include/llvm/IR/IntrinsicInst.h
===================================================================
--- include/llvm/IR/IntrinsicInst.h
+++ include/llvm/IR/IntrinsicInst.h
@@ -192,25 +192,122 @@
   };
 
   /// This class represents atomic memcpy intrinsic
-  /// TODO: Integrate this class into MemIntrinsic hierarchy.
-  class ElementAtomicMemCpyInst : public IntrinsicInst {
+  /// TODO: Integrate this class into MemIntrinsic hierarchy; for now this is
+  /// C&P of all methods from that hierarchy
+  class ElementUnorderedAtomicMemCpyInst : public IntrinsicInst {
+  private:
+    constexpr static int ARG_DEST = 0;
+    constexpr static int ARG_SRC = 1;
+    constexpr static int ARG_LENGTH = 2;
+    constexpr static int ARG_ALIGN = 3;
+    constexpr static int ARG_VOLATILE = 4;
+    constexpr static int ARG_DEST_UNORDERED = 5;
+    constexpr static int ARG_SRC_UNORDERED = 6;
+    constexpr static int ARG_ELEMENTSIZE = 7;
+
   public:
-    Value *getRawDest() const { return getArgOperand(0); }
-    Value *getRawSource() const { return getArgOperand(1); }
+    Value *getRawDest() const {
+      return const_cast<Value *>(getArgOperand(ARG_DEST));
+    }
+    const Use &getRawDestUse() const { return getArgOperandUse(ARG_DEST); }
+    Use &getRawDestUse() { return getArgOperandUse(ARG_DEST); }
+
+    /// Return the arguments to the instruction.
+    Value *getRawSource() const {
+      return const_cast<Value *>(getArgOperand(ARG_SRC));
+    }
+    const Use &getRawSourceUse() const { return getArgOperandUse(ARG_SRC); }
+    Use &getRawSourceUse() { return getArgOperandUse(ARG_SRC); }
+
+    Value *getLength() const {
+      return const_cast<Value *>(getArgOperand(ARG_LENGTH));
+    }
+    const Use &getLengthUse() const { return getArgOperandUse(ARG_LENGTH); }
+    Use &getLengthUse() { return getArgOperandUse(ARG_LENGTH); }
+
+    ConstantInt *getAlignmentCst() const {
+      return cast<ConstantInt>(const_cast<Value *>(getArgOperand(ARG_ALIGN)));
+    }
 
-    Value *getNumElements() const { return getArgOperand(2); }
-    void setNumElements(Value *V) { setArgOperand(2, V); }
+    unsigned getAlignment() const { return getAlignmentCst()->getZExtValue(); }
+
+    Type *getAlignmentType() const {
+      return getArgOperand(ARG_ALIGN)->getType();
+    }
 
-    uint64_t getSrcAlignment() const { return getParamAlignment(0); }
-    uint64_t getDstAlignment() const { return getParamAlignment(1); }
+    ConstantInt *getVolatileCst() const {
+      return cast<ConstantInt>(
+          const_cast<Value *>(getArgOperand(ARG_VOLATILE)));
+    }
 
-    uint64_t getElementSizeInBytes() const {
-      Value *Arg = getArgOperand(3);
+    bool isVolatile() const { return !getVolatileCst()->isZero(); }
+
+    uint8_t getDestUnordered() const {
+      Value *Arg = getArgOperand(ARG_DEST_UNORDERED);
+      return uint8_t(cast<ConstantInt>(Arg)->getZExtValue());
+    }
+
+    uint8_t getSrcUnordered() const {
+      Value *Arg = getArgOperand(ARG_SRC_UNORDERED);
+      return uint8_t(cast<ConstantInt>(Arg)->getZExtValue());
+    }
+
+    uint8_t getElementSizeInBytes() const {
+      Value *Arg = getArgOperand(ARG_ELEMENTSIZE);
       return cast<ConstantInt>(Arg)->getZExtValue();
     }
 
+    /// This is just like getRawDest, but it strips off any cast
+    /// instructions that feed it, giving the original input.  The returned
+    /// value is guaranteed to be a pointer.
+    Value *getDest() const { return getRawDest()->stripPointerCasts(); }
+
+    /// This is just like getRawSource, but it strips off any cast
+    /// instructions that feed it, giving the original input.  The returned
+    /// value is guaranteed to be a pointer.
+    Value *getSource() const { return getRawSource()->stripPointerCasts(); }
+
+    unsigned getDestAddressSpace() const {
+      return cast<PointerType>(getRawDest()->getType())->getAddressSpace();
+    }
+
+    unsigned getSourceAddressSpace() const {
+      return cast<PointerType>(getRawSource()->getType())->getAddressSpace();
+    }
+
+    /// Set the specified arguments of the instruction.
+    void setDest(Value *Ptr) {
+      assert(getRawDest()->getType() == Ptr->getType() &&
+             "setDest called with pointer of wrong type!");
+      setArgOperand(ARG_DEST, Ptr);
+    }
+
+    void setSource(Value *Ptr) {
+      assert(getRawSource()->getType() == Ptr->getType() &&
+             "setSource called with pointer of wrong type!");
+      setArgOperand(ARG_SRC, Ptr);
+    }
+
+    void setLength(Value *L) {
+      assert(getLength()->getType() == L->getType() &&
+             "setLength called with value of wrong type!");
+      setArgOperand(ARG_LENGTH, L);
+    }
+
+    void setAlignment(Constant *A) { setArgOperand(ARG_ALIGN, A); }
+
+    void setVolatile(Constant *V) { setArgOperand(ARG_VOLATILE, V); }
+
+    void setDestUnordered(Constant *V) { setArgOperand(ARG_DEST_UNORDERED, V); }
+
+    void setSrcUnordered(Constant *V) { setArgOperand(ARG_SRC_UNORDERED, V); }
+
+    void setElementSizeInBytes(Constant *V) {
+      setArgOperand(ARG_ELEMENTSIZE, V);
+    }
+
     static inline bool classof(const IntrinsicInst *I) {
-      return I->getIntrinsicID() == Intrinsic::memcpy_element_atomic;
+      return I->getIntrinsicID() == Intrinsic::memcpy_element_unordered_atomic;
     }
     static inline bool classof(const Value *V) {
       return isa<IntrinsicInst>(V) && classof(cast<IntrinsicInst>(V));
Index: include/llvm/IR/Intrinsics.td
===================================================================
--- include/llvm/IR/Intrinsics.td
+++ include/llvm/IR/Intrinsics.td
@@ -806,11 +806,18 @@
 //===------ Memory intrinsics with element-wise atomicity guarantees ------===//
 //
 
-def int_memcpy_element_atomic  : Intrinsic<[],
-                                           [llvm_anyptr_ty, llvm_anyptr_ty,
-                                            llvm_i64_ty, llvm_i32_ty],
-                                 [IntrArgMemOnly, NoCapture<0>, NoCapture<1>,
-                                  WriteOnly<0>, ReadOnly<1>]>;
+// llvm.memcpy.element.unordered.atomic(dest, src, length, alignment, volatile,
+// isunordered, elementsize)
+def int_memcpy_element_unordered_atomic
+    : Intrinsic<[],
+                [
+                  llvm_anyptr_ty, llvm_anyptr_ty, llvm_anyint_ty, llvm_i32_ty,
+                  llvm_i1_ty, llvm_i1_ty, llvm_i1_ty, llvm_i8_ty
+                ],
+                [
+                  IntrArgMemOnly, NoCapture<0>, NoCapture<1>, WriteOnly<0>,
+                  ReadOnly<1>
+                ]>;
 
 //===------------------------ Reduction Intrinsics ------------------------===//
 //
Index: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -4867,11 +4867,15 @@
     updateDAGForMaybeTailCall(MM);
     return nullptr;
   }
-  case Intrinsic::memcpy_element_atomic: {
+  case Intrinsic::memcpy_element_unordered_atomic: {
     SDValue Dst = getValue(I.getArgOperand(0));
     SDValue Src = getValue(I.getArgOperand(1));
-    SDValue NumElements = getValue(I.getArgOperand(2));
-    SDValue ElementSize = getValue(I.getArgOperand(3));
+    SDValue Length = getValue(I.getArgOperand(2));
+    SDValue Alignment = getValue(I.getArgOperand(3));
+    // Note: arg4 is isvolatile, which is unused for this intrinsic
+    SDValue DestUnordered = getValue(I.getArgOperand(5));
+    SDValue SrcUnordered = getValue(I.getArgOperand(6));
+    // SDValue ElementSize = getValue(I.getArgOperand(7));
 
     // Emit a library call.
     TargetLowering::ArgListTy Args;
@@ -4884,17 +4888,25 @@
     Args.push_back(Entry);
 
     Entry.Ty = I.getArgOperand(2)->getType();
-    Entry.Node = NumElements;
+    Entry.Node = Length;
     Args.push_back(Entry);
 
     Entry.Ty = Type::getInt32Ty(*DAG.getContext());
-    Entry.Node = ElementSize;
+    Entry.Node = Alignment;
+    Args.push_back(Entry);
+
+    Entry.Ty = Type::getInt1Ty(*DAG.getContext());
+    Entry.Node = DestUnordered;
+    Args.push_back(Entry);
+
+    Entry.Ty = Type::getInt1Ty(*DAG.getContext());
+    Entry.Node = SrcUnordered;
     Args.push_back(Entry);
 
     uint64_t ElementSizeConstant =
-        cast<ConstantInt>(I.getArgOperand(3))->getZExtValue();
+        cast<ConstantInt>(I.getArgOperand(7))->getZExtValue();
     RTLIB::Libcall LibraryCall =
-        RTLIB::getMEMCPY_ELEMENT_ATOMIC(ElementSizeConstant);
+        RTLIB::getMEMCPY_ELEMENT_UNORDERED_ATOMIC(ElementSizeConstant);
     if (LibraryCall == RTLIB::UNKNOWN_LIBCALL)
       report_fatal_error("Unsupported element size");
 
Index: lib/CodeGen/TargetLoweringBase.cpp
===================================================================
--- lib/CodeGen/TargetLoweringBase.cpp
+++ lib/CodeGen/TargetLoweringBase.cpp
@@ -374,11 +374,11 @@
   Names[RTLIB::MEMCPY] = "memcpy";
   Names[RTLIB::MEMMOVE] = "memmove";
   Names[RTLIB::MEMSET] = "memset";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_1] = "__llvm_memcpy_element_atomic_1";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_2] = "__llvm_memcpy_element_atomic_2";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_4] = "__llvm_memcpy_element_atomic_4";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_8] = "__llvm_memcpy_element_atomic_8";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_16] = "__llvm_memcpy_element_atomic_16";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_1] = "__llvm_memcpy_element_unordered_atomic_1";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_2] = "__llvm_memcpy_element_unordered_atomic_2";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_4] = "__llvm_memcpy_element_unordered_atomic_4";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_8] = "__llvm_memcpy_element_unordered_atomic_8";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_16] = "__llvm_memcpy_element_unordered_atomic_16";
   Names[RTLIB::UNWIND_RESUME] = "_Unwind_Resume";
   Names[RTLIB::SYNC_VAL_COMPARE_AND_SWAP_1] = "__sync_val_compare_and_swap_1";
   Names[RTLIB::SYNC_VAL_COMPARE_AND_SWAP_2] = "__sync_val_compare_and_swap_2";
@@ -781,22 +781,21 @@
   return UNKNOWN_LIBCALL;
 }
 
-RTLIB::Libcall RTLIB::getMEMCPY_ELEMENT_ATOMIC(uint64_t ElementSize) {
+RTLIB::Libcall RTLIB::getMEMCPY_ELEMENT_UNORDERED_ATOMIC(uint64_t ElementSize) {
   switch (ElementSize) {
   case 1:
-    return MEMCPY_ELEMENT_ATOMIC_1;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_1;
   case 2:
-    return MEMCPY_ELEMENT_ATOMIC_2;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_2;
   case 4:
-    return MEMCPY_ELEMENT_ATOMIC_4;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_4;
   case 8:
-    return MEMCPY_ELEMENT_ATOMIC_8;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_8;
   case 16:
-    return MEMCPY_ELEMENT_ATOMIC_16;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_16;
   default:
     return UNKNOWN_LIBCALL;
   }
-
 }
 
 /// InitCmpLibcallCCs - Set default comparison libcall CC.
Index: lib/IR/Verifier.cpp
===================================================================
--- lib/IR/Verifier.cpp
+++ lib/IR/Verifier.cpp
@@ -3987,10 +3987,40 @@
            CS);
     break;
   }
-  case Intrinsic::memcpy_element_atomic: {
-    ConstantInt *ElementSizeCI = dyn_cast<ConstantInt>(CS.getArgOperand(3));
-    Assert(ElementSizeCI, "element size of the element-wise atomic memory "
-                          "intrinsic must be a constant int",
+  case Intrinsic::memcpy_element_unordered_atomic: {
+    ConstantInt *AlignCI = dyn_cast<ConstantInt>(CS.getArgOperand(3));
+    Assert(AlignCI,
+           "alignment argument of element-wise unordered atomic memory "
+           "intrinsics must be a constant int",
+           CS);
+    const APInt &AlignVal = AlignCI->getValue();
+    Assert(AlignCI->isZero() || AlignVal.isPowerOf2(),
+           "alignment argument of element-wise unordered atomic memory "
+           "intrinsics must be a power of 2",
+           CS);
+
+    ConstantInt *DestUnorderedCI = dyn_cast<ConstantInt>(CS.getArgOperand(5));
+    Assert(DestUnorderedCI,
+           "dest_unordered of the element-wise unordered atomic memory "
+           "intrinsic must be a constant int",
+           CS);
+
+    ConstantInt *SrcUnorderedCI = dyn_cast<ConstantInt>(CS.getArgOperand(6));
+    Assert(SrcUnorderedCI,
+           "src_unordered of the element-wise unordered atomic memory "
+           "intrinsic must be a constant int",
+           CS);
+
+    // Cannot have both unordered flags being false.
+    Assert(!(DestUnorderedCI->isZero() && SrcUnorderedCI->isZero()),
+           "dest_unordered and src_unordered cannot both be zero on the "
+           "element-wise unordered atomic memory intrinsic",
+           CS);
+
+    ConstantInt *ElementSizeCI = dyn_cast<ConstantInt>(CS.getArgOperand(7));
+    Assert(ElementSizeCI,
+           "element size of the element-wise unordered atomic memory "
+           "intrinsic must be a constant int",
            CS);
     const APInt &ElementSizeVal = ElementSizeCI->getValue();
     Assert(ElementSizeVal.isPowerOf2(),
@@ -4001,16 +4031,12 @@
     auto IsValidAlignment = [&](uint64_t Alignment) {
       return isPowerOf2_64(Alignment) && ElementSizeVal.ule(Alignment);
     };
-
     uint64_t DstAlignment = CS.getParamAlignment(0),
              SrcAlignment = CS.getParamAlignment(1);
-
     Assert(IsValidAlignment(DstAlignment),
-           "incorrect alignment of the destination argument",
-           CS);
+           "incorrect alignment of the destination argument", CS);
     Assert(IsValidAlignment(SrcAlignment),
-           "incorrect alignment of the source argument",
-           CS);
+           "incorrect alignment of the source argument", CS);
     break;
   }
   case Intrinsic::gcroot:
Index: lib/Transforms/InstCombine/InstCombineCalls.cpp
===================================================================
--- lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -94,6 +94,7 @@
   return ConstantVector::get(BoolVec);
 }
 
+/* -- temp removal to aid staging
 Instruction *
 InstCombiner::SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI) {
   // Try to unfold this intrinsic into sequence of explicit atomic loads and
@@ -165,6 +166,7 @@
   AMI->setNumElements(Constant::getNullValue(NumElementsCI->getType()));
   return AMI;
 }
+*/
 
 Instruction *InstCombiner::SimplifyMemTransfer(MemIntrinsic *MI) {
   unsigned DstAlign = getKnownAlignment(MI->getArgOperand(0), DL, MI, &AC, &DT);
@@ -1892,6 +1894,7 @@
     if (Changed) return II;
   }
 
+  /* -- temp removal to simplify staging
   if (auto *AMI = dyn_cast<ElementAtomicMemCpyInst>(II)) {
     if (Constant *C = dyn_cast<Constant>(AMI->getNumElements()))
       if (C->isNullValue())
@@ -1900,7 +1903,8 @@
     if (Instruction *I = SimplifyElementAtomicMemCpy(AMI))
       return I;
   }
-
+  */
+  
   if (Instruction *I = SimplifyNVVMIntrinsic(II, *this))
     return I;
 
Index: lib/Transforms/InstCombine/InstCombineInternal.h
===================================================================
--- lib/Transforms/InstCombine/InstCombineInternal.h
+++ lib/Transforms/InstCombine/InstCombineInternal.h
@@ -687,7 +687,7 @@
   Instruction *MatchBSwap(BinaryOperator &I);
   bool SimplifyStoreAtEndOfBlock(StoreInst &SI);
 
-  Instruction *SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI);
+  // Instruction *SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI); -- temp removal to aid staging
   Instruction *SimplifyMemTransfer(MemIntrinsic *MI);
   Instruction *SimplifyMemSet(MemSetInst *MI);
 
Index: test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
===================================================================
--- test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
+++ test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
@@ -2,47 +2,77 @@
 
 define i8* @test_memcpy1(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 1, i32 1)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 1, i1 1, i8 1)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $1, %edx
-  ; CHECK-DAG: movl $1, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_1
+  ; 4th arg (%ecx) -- align
+  ; CHECK-DAG: movl $4, %ecx
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_1
 }
 
 define i8* @test_memcpy2(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy2
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 2, i32 2)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 2, i32 4, i1 0, i1 1, i1 1, i8 2)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $2, %edx
-  ; CHECK-DAG: movl $2, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_2
+  ; 4th arg (%ecx) -- align
+  ; CHECK-DAG: movl $4, %ecx
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_2
 }
 
 define i8* @test_memcpy4(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy4
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 4, i32 4)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 4, i32 4, i1 0, i1 1, i1 1, i8 4)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $4, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $4, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_4
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_4
 }
 
 define i8* @test_memcpy8(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy8
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 8 %P, i8* align 8 %Q, i64 8, i32 8)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 8 %P, i8* align 8 %Q, i32 8, i32 8, i1 0, i1 1, i1 1, i8 8)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $8, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $8, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_8
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_8
 }
 
 define i8* @test_memcpy16(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy16
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 16 %P, i8* align 16 %Q, i64 16, i32 16)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 16 %P, i8* align 16 %Q, i32 16, i32 16, i1 0, i1 1, i1 1, i8 16)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $16, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $16, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_16
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_16
 }
 
 define void @test_memcpy_args(i8** %Storage) {
@@ -51,18 +81,21 @@
   %Src.addr = getelementptr i8*, i8** %Storage, i64 1
   %Src = load i8*, i8** %Src.addr
 
-  ; First argument
+  ; 1st arg (%rdi)
   ; CHECK-DAG: movq (%rdi), [[REG1:%r.+]]
   ; CHECK-DAG: movq [[REG1]], %rdi
-  ; Second argument
+  ; 2nd arg (%rsi)
   ; CHECK-DAG: movq 8(%rdi), %rsi
-  ; Third argument
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $4, %edx
-  ; Fourth argument
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $4, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_4
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %Dst, i8* align 4 %Src, i64 4, i32 4)
-  ret void
+  ; 5th arg (%r8) -- dest_unordered
+  ; CHECK-DAG: movl $1, %r8d
+  ; 6th arg (%r9) -- src_unordered
+  ; CHECK-DAG: movl $1, %r9d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_4
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %Dst, i8* align 4 %Src, i32 4, i32 4, i1 0, i1 1, i1 1, i8 4)  ret void
 }
 
-declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* nocapture, i8* nocapture, i64, i32) nounwind
+declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1, i1, i1, i8) nounwind
Index: test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
===================================================================
--- test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
+++ test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
@@ -1,4 +1,6 @@
 ; RUN: opt -instcombine -unfold-element-atomic-memcpy-max-elements=8 -S < %s | FileCheck %s
+; Temporarily an expected failure until inst combine is updated in the next patch
+; XFAIL: *
 target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 
 ; Test basic unfolding
Index: test/Verifier/element-wise-atomic-memory-intrinsics.ll
===================================================================
--- test/Verifier/element-wise-atomic-memory-intrinsics.ll
+++ test/Verifier/element-wise-atomic-memory-intrinsics.ll
@@ -1,17 +1,38 @@
 ; RUN: not opt -verify < %s 2>&1 | FileCheck %s
 
-define void @test_memcpy(i8* %P, i8* %Q) {
+define void @test_memcpy(i8* %P, i8* %Q, i32 %A, i8 %E, i1 %V) {
+  ; CHECK: alignment argument of element-wise unordered atomic memory intrinsics must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 %A, i1 0, i1 1, i1 1, i8 1)
+
+  ; CHECK: alignment argument of element-wise unordered atomic memory intrinsics must be a power of 2
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 5, i1 0, i1 1, i1 1, i8 1)
+
+  ; CHECK: dest_unordered of the element-wise unordered atomic memory intrinsic must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 %V, i1 1, i8 1)
+
+  ; CHECK: src_unordered of the element-wise unordered atomic memory intrinsic must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 1, i1 %V, i8 1)
+
+  ; CHECK: dest_unordered and src_unordered cannot both be zero on the element-wise unordered atomic memory intrinsic
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 0, i1 0, i8 1)
+
+  ; CHECK: element size of the element-wise unordered atomic memory intrinsic must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 1, i1 1, i8 %E)
   ; CHECK: element size of the element-wise atomic memory intrinsic must be a power of 2
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 2 %P, i8* align 2 %Q, i64 4, i32 3)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 1, i1 1, i8 3)
 
   ; CHECK: incorrect alignment of the destination argument
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 2 %P, i8* align 4 %Q, i64 4, i32 4)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i1 1, i1 1, i8 1)
+  ; CHECK: incorrect alignment of the destination argument
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 1 %P, i8* align 4 %Q, i32 4, i32 4, i1 0, i1 1, i1 1, i8 4)
 
   ; CHECK: incorrect alignment of the source argument
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 2 %Q, i64 4, i32 4)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* %Q, i32 1, i32 4, i1 0, i1 1, i1 1, i8 1)
+  ; CHECK: incorrect alignment of the source argument
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 1 %Q, i32 4, i32 4, i1 0, i1 1, i1 1, i8 4)
+
 
   ret void
 }
-declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* nocapture, i8* nocapture, i64, i32) nounwind
-
+declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1, i1, i1, i8) nounwind
 ; CHECK: input module is broken!