Index: docs/LangRef.rst
===================================================================
--- docs/LangRef.rst
+++ docs/LangRef.rst
@@ -13553,62 +13553,66 @@
 These intrinsics are similar to the standard library memory intrinsics except
 that they perform memory transfer as a sequence of atomic memory accesses.
 
-.. _int_memcpy_element_atomic:
+.. _int_memcpy_element_unordered_atomic:
 
-'``llvm.memcpy.element.atomic``' Intrinsic
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+'``llvm.memcpy.element.unordered.atomic``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
 """""""
 
-This is an overloaded intrinsic. You can use ``llvm.memcpy.element.atomic`` on
+This is an overloaded intrinsic. You can use ``llvm.memcpy.element.unordered.atomic`` on
 any integer bit width and for different address spaces. Not all targets
 support all bit widths however.
 
 ::
 
-      declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* <dest>, i8* <src>,
-                                              i64 <num_elements>, i32 <element_size>)
+      declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* <dest>, i8* <src>, i32 <len>,
+                                                                       i32 <align>, i1 <isvolatile>,
+                                                                       i8 <isunordered>, i8 <element_size>)
+      declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i64(i8* <dest>, i8* <src>, i64 <len>,
+                                                                       i32 <align>, i1 <isvolatile>,
+                                                                       i8 <isunordered>, i8 <element_size>)
 
 Overview:
 """""""""
 
-The '``llvm.memcpy.element.atomic.*``' intrinsic performs copy of a block of 
-memory from the source location to the destination location as a sequence of
-unordered atomic memory accesses where each access is a multiple of
-``element_size`` bytes wide and aligned at an element size boundary. For example
-each element is accessed atomically in source and destination buffers.
+The '``llvm.memcpy.element.unordered.atomic.*``' intrinsic is a specialization of the '``llvm.memcpy.*``'
+intrinsic. It differs in that the ``dest`` and ``src`` are treated as arrays with elements that are
+exactly ``element_size`` bytes, and the copy between buffers is done in a way that uses
+:ref:`unordered atomic <ordering>` load/store operations  that are a positive integer multiple
+of the ``element_size`` in size.
 
 Arguments:
 """"""""""
 
-The first argument is a pointer to the destination, the second is a
-pointer to the source. The third argument is an integer argument
-specifying the number of elements to copy, the fourth argument is size of
-the single element in bytes.
+The first five arguments are the same as they are in the :ref:`@llvm.memcpy <int_memcpy>` intrinsic,
+with the added constraint that ``len`` must be a positive integer multiple of the ``element_size``.
 
-``element_size`` should be a power of two, greater than zero and less than
-a target-specific atomic access size limit.
+``isunordered`` must be a value equal to exactly 1, 2, or 3.
 
-For each of the input pointers ``align`` parameter attribute must be specified.
-It must be a power of two and greater than or equal to the ``element_size``.
-Caller guarantees that both the source and destination pointers are aligned to
-that boundary.
+``element_size`` must be a compile-time constant positive power of two no greater than target-specific atomic
+access size limit.
 
 Semantics:
 """"""""""
 
-The '``llvm.memcpy.element.atomic.*``' intrinsic copies
-'``num_elements`` * ``element_size``' bytes of memory from the source location to
-the destination location. These locations are not allowed to overlap. Memory copy
-is performed as a sequence of unordered atomic memory accesses where each access
-is guaranteed to be a multiple of ``element_size`` bytes wide and aligned at an
-element size boundary.
+The '``llvm.memcpy.element.unordered.atomic.*``' intrinsic copies ``len`` bytes of memory from
+the source location to the destination location. These locations are not allowed to overlap.
+The memory copy is performed as a sequence of load/store operations where each access is
+guaranteed to be a multiple of ``element_size`` bytes wide and aligned at an ``element_size``
+boundary. Furthermore, the load/store operations used will be atomic unordered operations as
+dictated by ``isunordered`` as follows:
+
+* ``isunordered`` == 1 : Stores to the dest are unordered atomic.
+* ``isunordered`` == 2 : Loads from the src are unordered atomic.
+* ``isunordered`` == 3 : Stores to the dest, and loads from the src are all unordered atomic.
 
 The order of the copy is unspecified. The same value may be read from the source
 buffer many times, but only one write is issued to the destination buffer per
 element. It is well defined to have concurrent reads and writes to both source
-and destination provided those reads and writes are at least unordered atomic.
+and destination provided those reads and writes are at least unordered atomic, and
+the value of ``isunordered`` has the appropriate value.
 
 This intrinsic does not provide any additional ordering guarantees over those
 provided by a set of unordered loads from the source location and stores to the
@@ -13617,8 +13621,8 @@
 Lowering:
 """""""""
 
-In the most general case call to the '``llvm.memcpy.element.atomic.*``' is lowered
-to a call to the symbol ``__llvm_memcpy_element_atomic_*``. Where '*' is replaced
+In the most general case call to the '``llvm.memcpy.element.unordered.atomic.*``' is lowered
+to a call to the symbol ``__llvm_memcpy_element_unordered_atomic_*``. Where '*' is replaced
 with an actual element size.
 
-Optimizer is allowed to inline memory copy when it's profitable to do so.
+The optimizer is allowed to inline the memory copy when it's profitable to do so.
Index: include/llvm/CodeGen/RuntimeLibcalls.h
===================================================================
--- include/llvm/CodeGen/RuntimeLibcalls.h
+++ include/llvm/CodeGen/RuntimeLibcalls.h
@@ -333,12 +333,12 @@
     MEMSET,
     MEMMOVE,
 
-    // ELEMENT-WISE ATOMIC MEMORY
-    MEMCPY_ELEMENT_ATOMIC_1,
-    MEMCPY_ELEMENT_ATOMIC_2,
-    MEMCPY_ELEMENT_ATOMIC_4,
-    MEMCPY_ELEMENT_ATOMIC_8,
-    MEMCPY_ELEMENT_ATOMIC_16,
+    // ELEMENT-WISE UNORDERED-ATOMIC MEMORY of different element sizes
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_1,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_2,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_4,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_8,
+    MEMCPY_ELEMENT_UNORDERED_ATOMIC_16,
 
     // EXCEPTION HANDLING
     UNWIND_RESUME,
@@ -511,9 +511,9 @@
   /// UNKNOWN_LIBCALL if there is none.
   Libcall getSYNC(unsigned Opc, MVT VT);
 
-  /// getMEMCPY_ELEMENT_ATOMIC - Return MEMCPY_ELEMENT_ATOMIC_* value for the
+  /// getMEMCPY_ELEMENT_UNORDERED_ATOMIC - Return MEMCPY_ELEMENT_UNORDERED_ATOMIC_* value for the
   /// given element size or UNKNOW_LIBCALL if there is none.
-  Libcall getMEMCPY_ELEMENT_ATOMIC(uint64_t ElementSize);
+  Libcall getMEMCPY_ELEMENT_UNORDERED_ATOMIC(uint64_t ElementSize);
 }
 }
 
Index: include/llvm/IR/IntrinsicInst.h
===================================================================
--- include/llvm/IR/IntrinsicInst.h
+++ include/llvm/IR/IntrinsicInst.h
@@ -192,25 +192,113 @@
   };
 
   /// This class represents atomic memcpy intrinsic
-  /// TODO: Integrate this class into MemIntrinsic hierarchy.
-  class ElementAtomicMemCpyInst : public IntrinsicInst {
+  /// TODO: Integrate this class into MemIntrinsic hierarchy; for now this is
+  /// C&P of all methods from that hierarchy
+  class ElementUnorderedAtomicMemCpyInst : public IntrinsicInst {
+  private:
+    constexpr static int ARG_DEST = 0;
+    constexpr static int ARG_SRC = 1;
+    constexpr static int ARG_LENGTH = 2;
+    constexpr static int ARG_ALIGN = 3;
+    constexpr static int ARG_VOLATILE = 4;
+    constexpr static int ARG_ISUNORDERED = 5;
+    constexpr static int ARG_ELEMENTSIZE = 6;
   public:
-    Value *getRawDest() const { return getArgOperand(0); }
-    Value *getRawSource() const { return getArgOperand(1); }
+    Value *getRawDest() const {
+      return const_cast<Value *>(getArgOperand(ARG_DEST));
+    }
+    const Use &getRawDestUse() const { return getArgOperandUse(ARG_DEST); }
+    Use &getRawDestUse() { return getArgOperandUse(ARG_DEST); }
+
+    /// Return the arguments to the instruction.
+    Value *getRawSource() const {
+      return const_cast<Value *>(getArgOperand(ARG_SRC));
+    }
+    const Use &getRawSourceUse() const { return getArgOperandUse(ARG_SRC); }
+    Use &getRawSourceUse() { return getArgOperandUse(ARG_SRC); }
 
-    Value *getNumElements() const { return getArgOperand(2); }
-    void setNumElements(Value *V) { setArgOperand(2, V); }
+    Value *getLength() const {
+      return const_cast<Value *>(getArgOperand(ARG_LENGTH));
+    }
+    const Use &getLengthUse() const { return getArgOperandUse(ARG_LENGTH); }
+    Use &getLengthUse() { return getArgOperandUse(ARG_LENGTH); }
+
+    ConstantInt *getAlignmentCst() const {
+      return cast<ConstantInt>(const_cast<Value *>(getArgOperand(ARG_ALIGN)));
+    }
 
-    uint64_t getSrcAlignment() const { return getParamAlignment(0); }
-    uint64_t getDstAlignment() const { return getParamAlignment(1); }
+    unsigned getAlignment() const { return getAlignmentCst()->getZExtValue(); }
+
+    Type *getAlignmentType() const {
+      return getArgOperand(ARG_ALIGN)->getType();
+    }
+
+    ConstantInt *getVolatileCst() const {
+      return cast<ConstantInt>(
+          const_cast<Value *>(getArgOperand(ARG_VOLATILE)));
+    }
 
-    uint64_t getElementSizeInBytes() const {
-      Value *Arg = getArgOperand(3);
+    bool isVolatile() const { return !getVolatileCst()->isZero(); }
+
+    uint8_t getIsUnordered() const {
+      Value *Arg = getArgOperand(ARG_ISUNORDERED);
+      return uint8_t(cast<ConstantInt>(Arg)->getZExtValue());
+    }
+
+    uint8_t getElementSizeInBytes() const {
+      Value *Arg = getArgOperand(ARG_ELEMENTSIZE);
       return cast<ConstantInt>(Arg)->getZExtValue();
     }
 
+    /// This is just like getRawDest, but it strips off any cast
+    /// instructions that feed it, giving the original input.  The returned
+    /// value is guaranteed to be a pointer.
+    Value *getDest() const { return getRawDest()->stripPointerCasts(); }
+
+    /// This is just like getRawSource, but it strips off any cast
+    /// instructions that feed it, giving the original input.  The returned
+    /// value is guaranteed to be a pointer.
+    Value *getSource() const { return getRawSource()->stripPointerCasts(); }
+
+    unsigned getDestAddressSpace() const {
+      return cast<PointerType>(getRawDest()->getType())->getAddressSpace();
+    }
+
+    unsigned getSourceAddressSpace() const {
+      return cast<PointerType>(getRawSource()->getType())->getAddressSpace();
+    }
+
+    /// Set the specified arguments of the instruction.
+    void setDest(Value *Ptr) {
+      assert(getRawDest()->getType() == Ptr->getType() &&
+             "setDest called with pointer of wrong type!");
+      setArgOperand(ARG_DEST, Ptr);
+    }
+
+    void setSource(Value *Ptr) {
+      assert(getRawSource()->getType() == Ptr->getType() &&
+             "setSource called with pointer of wrong type!");
+      setArgOperand(ARG_SRC, Ptr);
+    }
+
+    void setLength(Value *L) {
+      assert(getLength()->getType() == L->getType() &&
+             "setLength called with value of wrong type!");
+      setArgOperand(ARG_LENGTH, L);
+    }
+
+    void setAlignment(Constant *A) { setArgOperand(ARG_ALIGN, A); }
+
+    void setVolatile(Constant *V) { setArgOperand(ARG_VOLATILE, V); }
+
+    void setIsUnordered(Constant *V) { setArgOperand(ARG_ISUNORDERED, V); }
+
+    void setElementSizeInBytes(Constant *V) {
+      setArgOperand(ARG_ELEMENTSIZE, V);
+    }
+
     static inline bool classof(const IntrinsicInst *I) {
-      return I->getIntrinsicID() == Intrinsic::memcpy_element_atomic;
+      return I->getIntrinsicID() == Intrinsic::memcpy_element_unordered_atomic;
     }
     static inline bool classof(const Value *V) {
       return isa<IntrinsicInst>(V) && classof(cast<IntrinsicInst>(V));
Index: include/llvm/IR/Intrinsics.td
===================================================================
--- include/llvm/IR/Intrinsics.td
+++ include/llvm/IR/Intrinsics.td
@@ -806,11 +806,18 @@
 //===------ Memory intrinsics with element-wise atomicity guarantees ------===//
 //
 
-def int_memcpy_element_atomic  : Intrinsic<[],
-                                           [llvm_anyptr_ty, llvm_anyptr_ty,
-                                            llvm_i64_ty, llvm_i32_ty],
-                                 [IntrArgMemOnly, NoCapture<0>, NoCapture<1>,
-                                  WriteOnly<0>, ReadOnly<1>]>;
+// llvm.memcpy.element.unordered.atomic(dest, src, length, alignment, volatile,
+// isunordered, elementsize)
+def int_memcpy_element_unordered_atomic
+    : Intrinsic<[],
+                [
+                  llvm_anyptr_ty, llvm_anyptr_ty, llvm_anyint_ty, llvm_i32_ty,
+                  llvm_i1_ty, llvm_i8_ty, llvm_i8_ty
+                ],
+                [
+                  IntrArgMemOnly, NoCapture<0>, NoCapture<1>, WriteOnly<0>,
+                  ReadOnly<1>
+                ]>;
 
 //===------------------------ Reduction Intrinsics ------------------------===//
 //
Index: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -4867,11 +4867,14 @@
     updateDAGForMaybeTailCall(MM);
     return nullptr;
   }
-  case Intrinsic::memcpy_element_atomic: {
+  case Intrinsic::memcpy_element_unordered_atomic: {
     SDValue Dst = getValue(I.getArgOperand(0));
     SDValue Src = getValue(I.getArgOperand(1));
-    SDValue NumElements = getValue(I.getArgOperand(2));
-    SDValue ElementSize = getValue(I.getArgOperand(3));
+    SDValue Length = getValue(I.getArgOperand(2));
+    SDValue Alignment = getValue(I.getArgOperand(3));
+    // Note: arg4 is isvolatile, which is unused for this intrinsic
+    SDValue IsUnordered = getValue(I.getArgOperand(5));
+    // SDValue ElementSize = getValue(I.getArgOperand(6));
 
     // Emit a library call.
     TargetLowering::ArgListTy Args;
@@ -4884,17 +4887,21 @@
     Args.push_back(Entry);
 
     Entry.Ty = I.getArgOperand(2)->getType();
-    Entry.Node = NumElements;
+    Entry.Node = Length;
     Args.push_back(Entry);
 
     Entry.Ty = Type::getInt32Ty(*DAG.getContext());
-    Entry.Node = ElementSize;
+    Entry.Node = Alignment;
+    Args.push_back(Entry);
+
+    Entry.Ty = Type::getIntNTy(*DAG.getContext(), 2);
+    Entry.Node = IsUnordered;
     Args.push_back(Entry);
 
     uint64_t ElementSizeConstant =
-        cast<ConstantInt>(I.getArgOperand(3))->getZExtValue();
+        cast<ConstantInt>(I.getArgOperand(6))->getZExtValue();
     RTLIB::Libcall LibraryCall =
-        RTLIB::getMEMCPY_ELEMENT_ATOMIC(ElementSizeConstant);
+        RTLIB::getMEMCPY_ELEMENT_UNORDERED_ATOMIC(ElementSizeConstant);
     if (LibraryCall == RTLIB::UNKNOWN_LIBCALL)
       report_fatal_error("Unsupported element size");
 
Index: lib/CodeGen/TargetLoweringBase.cpp
===================================================================
--- lib/CodeGen/TargetLoweringBase.cpp
+++ lib/CodeGen/TargetLoweringBase.cpp
@@ -374,11 +374,11 @@
   Names[RTLIB::MEMCPY] = "memcpy";
   Names[RTLIB::MEMMOVE] = "memmove";
   Names[RTLIB::MEMSET] = "memset";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_1] = "__llvm_memcpy_element_atomic_1";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_2] = "__llvm_memcpy_element_atomic_2";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_4] = "__llvm_memcpy_element_atomic_4";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_8] = "__llvm_memcpy_element_atomic_8";
-  Names[RTLIB::MEMCPY_ELEMENT_ATOMIC_16] = "__llvm_memcpy_element_atomic_16";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_1] = "__llvm_memcpy_element_unordered_atomic_1";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_2] = "__llvm_memcpy_element_unordered_atomic_2";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_4] = "__llvm_memcpy_element_unordered_atomic_4";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_8] = "__llvm_memcpy_element_unordered_atomic_8";
+  Names[RTLIB::MEMCPY_ELEMENT_UNORDERED_ATOMIC_16] = "__llvm_memcpy_element_unordered_atomic_16";
   Names[RTLIB::UNWIND_RESUME] = "_Unwind_Resume";
   Names[RTLIB::SYNC_VAL_COMPARE_AND_SWAP_1] = "__sync_val_compare_and_swap_1";
   Names[RTLIB::SYNC_VAL_COMPARE_AND_SWAP_2] = "__sync_val_compare_and_swap_2";
@@ -781,22 +781,21 @@
   return UNKNOWN_LIBCALL;
 }
 
-RTLIB::Libcall RTLIB::getMEMCPY_ELEMENT_ATOMIC(uint64_t ElementSize) {
+RTLIB::Libcall RTLIB::getMEMCPY_ELEMENT_UNORDERED_ATOMIC(uint64_t ElementSize) {
   switch (ElementSize) {
   case 1:
-    return MEMCPY_ELEMENT_ATOMIC_1;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_1;
   case 2:
-    return MEMCPY_ELEMENT_ATOMIC_2;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_2;
   case 4:
-    return MEMCPY_ELEMENT_ATOMIC_4;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_4;
   case 8:
-    return MEMCPY_ELEMENT_ATOMIC_8;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_8;
   case 16:
-    return MEMCPY_ELEMENT_ATOMIC_16;
+    return MEMCPY_ELEMENT_UNORDERED_ATOMIC_16;
   default:
     return UNKNOWN_LIBCALL;
   }
-
 }
 
 /// InitCmpLibcallCCs - Set default comparison libcall CC.
Index: lib/IR/Verifier.cpp
===================================================================
--- lib/IR/Verifier.cpp
+++ lib/IR/Verifier.cpp
@@ -3987,29 +3987,39 @@
            CS);
     break;
   }
-  case Intrinsic::memcpy_element_atomic: {
-    ConstantInt *ElementSizeCI = dyn_cast<ConstantInt>(CS.getArgOperand(3));
-    Assert(ElementSizeCI, "element size of the element-wise atomic memory "
-                          "intrinsic must be a constant int",
+  case Intrinsic::memcpy_element_unordered_atomic: {
+    ConstantInt *AlignCI = dyn_cast<ConstantInt>(CS.getArgOperand(3));
+    Assert(AlignCI,
+           "alignment argument of element-wise unordered atomic memory "
+           "intrinsics must be a constant int",
            CS);
-    const APInt &ElementSizeVal = ElementSizeCI->getValue();
-    Assert(ElementSizeVal.isPowerOf2(),
-           "element size of the element-wise atomic memory intrinsic "
-           "must be a power of 2",
+    const APInt &AlignVal = AlignCI->getValue();
+    Assert(AlignCI->isZero() || AlignVal.isPowerOf2(),
+           "alignment argument of element-wise unordered atomic memory "
+           "intrinsics must be a power of 2",
            CS);
 
-    auto IsValidAlignment = [&](uint64_t Alignment) {
-      return isPowerOf2_64(Alignment) && ElementSizeVal.ule(Alignment);
-    };
+    ConstantInt *IsUnorderedCI = dyn_cast<ConstantInt>(CS.getArgOperand(5));
+    Assert(IsUnorderedCI,
+           "isunordered of the element-wise unordered atomic memory "
+           "intrinsic must be a constant int",
+           CS);
 
-    uint64_t DstAlignment = CS.getParamAlignment(0),
-             SrcAlignment = CS.getParamAlignment(1);
+    const APInt &IsUnorderedVal = IsUnorderedCI->getValue();
+    Assert(!IsUnorderedCI->isZero() && IsUnorderedVal.ult(4),
+           "isunordered of the element-wise unordered atomic memory intrinsic "
+           "must be in the range [1,3]",
+           CS);
 
-    Assert(IsValidAlignment(DstAlignment),
-           "incorrect alignment of the destination argument",
+    ConstantInt *ElementSizeCI = dyn_cast<ConstantInt>(CS.getArgOperand(6));
+    Assert(ElementSizeCI,
+           "element size of the element-wise unordered atomic memory "
+           "intrinsic must be a constant int",
            CS);
-    Assert(IsValidAlignment(SrcAlignment),
-           "incorrect alignment of the source argument",
+    const APInt &ElementSizeVal = ElementSizeCI->getValue();
+    Assert(ElementSizeVal.isPowerOf2(),
+           "element size of the element-wise atomic memory intrinsic "
+           "must be a power of 2",
            CS);
     break;
   }
Index: lib/Transforms/InstCombine/InstCombineCalls.cpp
===================================================================
--- lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -94,6 +94,7 @@
   return ConstantVector::get(BoolVec);
 }
 
+/* -- temp removal to aid staging
 Instruction *
 InstCombiner::SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI) {
   // Try to unfold this intrinsic into sequence of explicit atomic loads and
@@ -165,6 +166,7 @@
   AMI->setNumElements(Constant::getNullValue(NumElementsCI->getType()));
   return AMI;
 }
+*/
 
 Instruction *InstCombiner::SimplifyMemTransfer(MemIntrinsic *MI) {
   unsigned DstAlign = getKnownAlignment(MI->getArgOperand(0), DL, MI, &AC, &DT);
@@ -1892,6 +1894,7 @@
     if (Changed) return II;
   }
 
+  /* -- temp removal to simplify staging
   if (auto *AMI = dyn_cast<ElementAtomicMemCpyInst>(II)) {
     if (Constant *C = dyn_cast<Constant>(AMI->getNumElements()))
       if (C->isNullValue())
@@ -1900,7 +1903,8 @@
     if (Instruction *I = SimplifyElementAtomicMemCpy(AMI))
       return I;
   }
-
+  */
+  
   if (Instruction *I = SimplifyNVVMIntrinsic(II, *this))
     return I;
 
Index: lib/Transforms/InstCombine/InstCombineInternal.h
===================================================================
--- lib/Transforms/InstCombine/InstCombineInternal.h
+++ lib/Transforms/InstCombine/InstCombineInternal.h
@@ -687,7 +687,7 @@
   Instruction *MatchBSwap(BinaryOperator &I);
   bool SimplifyStoreAtEndOfBlock(StoreInst &SI);
 
-  Instruction *SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI);
+  // Instruction *SimplifyElementAtomicMemCpy(ElementAtomicMemCpyInst *AMI); -- temp removal to aid staging
   Instruction *SimplifyMemTransfer(MemIntrinsic *MI);
   Instruction *SimplifyMemSet(MemSetInst *MI);
 
Index: test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
===================================================================
--- test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
+++ test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll
@@ -2,47 +2,67 @@
 
 define i8* @test_memcpy1(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 1, i32 1)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 3, i8 1)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $1, %edx
-  ; CHECK-DAG: movl $1, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_1
+  ; 4th arg (%ecx) -- align
+  ; CHECK-DAG: movl $4, %ecx
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_1
 }
 
 define i8* @test_memcpy2(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy2
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 2, i32 2)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 2, i32 4, i1 0, i8 3, i8 2)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $2, %edx
-  ; CHECK-DAG: movl $2, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_2
+  ; 4th arg (%ecx) -- align
+  ; CHECK-DAG: movl $4, %ecx
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_2
 }
 
 define i8* @test_memcpy4(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy4
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 4 %Q, i64 4, i32 4)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 4, i32 4, i1 0, i8 3, i8 4)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $4, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $4, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_4
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_4
 }
 
 define i8* @test_memcpy8(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy8
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 8 %P, i8* align 8 %Q, i64 8, i32 8)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 8 %P, i8* align 8 %Q, i32 8, i32 8, i1 0, i8 3, i8 8)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $8, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $8, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_8
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_8
 }
 
 define i8* @test_memcpy16(i8* %P, i8* %Q) {
   ; CHECK: test_memcpy16
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 16 %P, i8* align 16 %Q, i64 16, i32 16)
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 16 %P, i8* align 16 %Q, i32 16, i32 16, i1 0, i8 3, i8 16)
   ret i8* %P
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $16, %edx
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $16, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_16
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_16
 }
 
 define void @test_memcpy_args(i8** %Storage) {
@@ -51,18 +71,19 @@
   %Src.addr = getelementptr i8*, i8** %Storage, i64 1
   %Src = load i8*, i8** %Src.addr
 
-  ; First argument
+  ; 1st arg (%rdi)
   ; CHECK-DAG: movq (%rdi), [[REG1:%r.+]]
   ; CHECK-DAG: movq [[REG1]], %rdi
-  ; Second argument
+  ; 2nd arg (%rsi)
   ; CHECK-DAG: movq 8(%rdi), %rsi
-  ; Third argument
+  ; 3rd arg (%edx) -- size
   ; CHECK-DAG: movl $4, %edx
-  ; Fourth argument
+  ; 4th arg (%ecx) -- align
   ; CHECK-DAG: movl $4, %ecx
-  ; CHECK: __llvm_memcpy_element_atomic_4
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %Dst, i8* align 4 %Src, i64 4, i32 4)
-  ret void
+  ; 5th arg (%r8) -- isunordered
+  ; CHECK-DAG: movl $3, %r8d
+  ; CHECK: __llvm_memcpy_element_unordered_atomic_4
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %Dst, i8* align 4 %Src, i32 4, i32 4, i1 0, i8 3, i8 4)  ret void
 }
 
-declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* nocapture, i8* nocapture, i64, i32) nounwind
+declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1, i8, i8) nounwind
Index: test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
===================================================================
--- test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
+++ test/Transforms/InstCombine/element-atomic-memcpy-to-loads.ll
@@ -1,4 +1,6 @@
 ; RUN: opt -instcombine -unfold-element-atomic-memcpy-max-elements=8 -S < %s | FileCheck %s
+; Temporarily an expected failure until inst combine is updated in the next patch
+; XFAIL: *
 target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 
 ; Test basic unfolding
Index: test/Verifier/element-wise-atomic-memory-intrinsics.ll
===================================================================
--- test/Verifier/element-wise-atomic-memory-intrinsics.ll
+++ test/Verifier/element-wise-atomic-memory-intrinsics.ll
@@ -1,17 +1,26 @@
 ; RUN: not opt -verify < %s 2>&1 | FileCheck %s
 
-define void @test_memcpy(i8* %P, i8* %Q) {
-  ; CHECK: element size of the element-wise atomic memory intrinsic must be a power of 2
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 2 %P, i8* align 2 %Q, i64 4, i32 3)
+define void @test_memcpy(i8* %P, i8* %Q, i32 %A, i8 %E) {
+  ; CHECK: alignment argument of element-wise unordered atomic memory intrinsics must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 %A, i1 0, i8 3, i8 1)
+
+  ; CHECK: alignment argument of element-wise unordered atomic memory intrinsics must be a power of 2
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 5, i1 0, i8 3, i8 1)
 
-  ; CHECK: incorrect alignment of the destination argument
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 2 %P, i8* align 4 %Q, i64 4, i32 4)
+  ; CHECK: isunordered of the element-wise unordered atomic memory intrinsic must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 %E, i8 1)
 
-  ; CHECK: incorrect alignment of the source argument
-  call void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* align 4 %P, i8* align 2 %Q, i64 4, i32 4)
+  ; CHECK: isunordered of the element-wise unordered atomic memory intrinsic must be in the range [1,3]
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 0, i8 1)
+  ; CHECK: isunordered of the element-wise unordered atomic memory intrinsic must be in the range [1,3]
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 4, i8 1)
+
+  ; CHECK: element size of the element-wise unordered atomic memory intrinsic must be a constant int
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 3, i8 %E)
+  ; CHECK: element size of the element-wise atomic memory intrinsic must be a power of 2
+  call void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* align 4 %P, i8* align 4 %Q, i32 1, i32 4, i1 0, i8 3, i8 3)
 
   ret void
 }
-declare void @llvm.memcpy.element.atomic.p0i8.p0i8(i8* nocapture, i8* nocapture, i64, i32) nounwind
-
+declare void @llvm.memcpy.element.unordered.atomic.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1, i8, i8) nounwind
 ; CHECK: input module is broken!