diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -15525,6 +15525,7 @@
 
 Syntax:
 """""""
+This is an overloaded intrinsic.
 
 ::
 
@@ -15549,17 +15550,20 @@
 -----------------
 
 Operations on matrixes requiring shape information (like number of rows/columns
-or the memory layout) can be expressed using the matrix intrinsics. Matrixes are
-embedded in a flat vector and the intrinsics take the dimensions as arguments.
-Currently column-major layout is assumed. The intrinsics support both integer
-and floating point matrixes.
+or the memory layout) can be expressed using the matrix intrinsics. These
+intrinsics require matrix dimensions to be passed as immediate arguments, and
+matrixes are passed and returned as vectors. This means that for a ``R`` x
+``C`` matrix, element ``i`` of column ``j`` is at index ``j * R + i`` in the
+corresponding vector, with indices starting at 0. Currently column-major layout
+is assumed.  The intrinsics support both integer and floating point matrixes.
 
 
 '``llvm.matrix.transpose.*``' Intrinsic
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
 """""""
+This is an overloaded intrinsic.
 
 ::
 
@@ -15568,21 +15572,24 @@
 Overview:
 """""""""
 
-The '``llvm.matrix.transpose.*``' intrinsic treats %In as containing a matrix
-with <Rows> rows and <Cols> columns and returns the transposed matrix embedded in
-the result vector.
+The '``llvm.matrix.transpose.*``' intrinsics treat %In as a <Rows> x <Cols> matrix
+and return the transposed matrix in the result vector.
 
 Arguments:
 """"""""""
 
-The <Rows> and <Cols> arguments must be constant integers. The vector argument
-%In and the returned vector must have <Rows> * <Cols> elements.
+First argument %In is vector that corresponds to a <Rows> x <Cols> matrix.
+Thus, arguments <Rows> and <Cols> correspond to the number of rows and columns,
+respectively, and must be positive, constant integers. The returned vector must
+have <Rows> * <Cols> elements, and have the same float or integer element type
+as %In.
 
 '``llvm.matrix.multiply.*``' Intrinsic
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
 """""""
+This is an overloaded intrinsic.
 
 ::
 
@@ -15591,18 +15598,19 @@
 Overview:
 """""""""
 
-The '``llvm.matrix.multiply.*``' intrinsic treats %A as a matrix with <OuterRows>
-rows and <Inner> columns, %B as a matrix with <Inner> rows and <OuterColumns>
-columns and multiplies them. The result matrix is returned embedded in the
-result vector.
+The '``llvm.matrix.multiply.*``' intrinsics treat %A as a <OuterRows> x <Inner>
+matrix, %B as a <Inner> x <OuterColumns> matrix, and multiplies them. The result
+matrix is returned in the result vector.
 
 Arguments:
 """"""""""
 
-The <OuterRows>, <Inner> and <OuterColumns> arguments must be constant
-integers. The vector argument %A must have <OuterRows> * <Inner> elements, %B
-must have <Inner> * <OuterColumns> elements and the returned vector must have
-<OuterRows> * <OuterColumns> elements.
+The first vector argument %A corresponds to a matrix with <OuterRows> * <Inner>
+elements, and the second argument %B to a matrix with <Inner> * <OuterColumns>
+elements. Arguments <OuterRows>, <Inner> and <OuterColumns> must be positive,
+constant integers. The returned vector must have <OuterRows> * <OuterColumns>
+elements. Vectors %A, %B, and the returned vector all have the same float or
+integer element type.
 
 
 '``llvm.matrix.column.major.load.*``' Intrinsic
@@ -15610,6 +15618,7 @@
 
 Syntax:
 """""""
+This is an overloaded intrinsic.
 
 ::
 
@@ -15619,22 +15628,26 @@
 Overview:
 """""""""
 
-The '``llvm.matrix.column.major.load.*``' intrinsic loads a matrix with <Rows>
-rows and <Cols> columns, using a stride of %Stride between columns. For two
-consecutive columns A and B, %Stride refers to the distance (the number of
-elements) between the start of column A and the start of column B. The result
-matrix is returned embedded in the result vector. This allows for convenient
-loading of sub matrixes.  If <IsVolatile> is true, the intrinsic is considered
-a :ref:`volatile memory access <volatile>`.
-
-If the %Ptr argument is known to be aligned to some boundary, this can be
-specified as an attribute on the argument.
+The '``llvm.matrix.column.major.load.*``' intrinsics load a <Rows> x <Cols>
+matrix using a stride of %Stride to compute the start address of the different
+columns.  This allows for convenient loading of sub matrixes. If <IsVolatile>
+is true, the intrinsic is considered a :ref:`volatile memory access
+<volatile>`. The result matrix is returned in the result vector. If the %Ptr
+argument is known to be aligned to some boundary, this can be specified as an
+attribute on the argument.
 
 Arguments:
 """"""""""
 
-The <IsVolatile>, <Rows> and <Cols> arguments must be constant integers. The
-returned vector must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.
+The first argument %Ptr is a pointer type to the returned vector type, and
+correponds to the start address to load from. The second argument %Stride is a
+postive, constant integer with %Stride ``>=`` <Rows>. %Stride is used to compute
+the column memory addresses. I.e., for a column ``C``, its start memory
+addresses is calculated with %Ptr + ``C`` * %Stride. The third Argument
+<IsVolatile> is a boolean value.  The fourth and fifth arguments, <Rows> and
+<Cols>, correspond to the number of rows and columns, respectively, and must be
+positive, constant integers. The returned vector must have <Rows> * <Cols>
+elements.
 
 The :ref:`align <attr_align>` parameter attribute can be provided
 for the %Ptr arguments.
@@ -15654,12 +15667,10 @@
 Overview:
 """""""""
 
-The '``llvm.matrix.column.major.store.*``' intrinsic stores the matrix with
-<Rows> rows and <Cols> columns embedded in %In, using a stride of %Stride
-between columns. For two consecutive columns A and B, %Stride refers to the
-distance (the number of elements) between the start of column A and the start
-of column B. If <IsVolatile> is true, the intrinsic is considered a
-:ref:`volatile memory access <volatile>`.
+The '``llvm.matrix.column.major.store.*``' intrinsics store the <Rows> x <Cols>
+matrix in %In to memory using a stride of %Stride between columns. If
+<IsVolatile> is true, the intrinsic is considered a :ref:`volatile memory
+access <volatile>`.
 
 If the %Ptr argument is known to be aligned to some boundary, this can be
 specified as an attribute on the argument.
@@ -15667,8 +15678,15 @@
 Arguments:
 """"""""""
 
-The <IsVolatile>, <Rows>, <Cols> arguments must be constant integers. The
-vector argument %In must have <Rows> * <Cols> elements. %Stride must be >= <Rows>.
+The first argument %In is a vector that corresponds to a <Rows> x <Cols> matrix
+to be stored to memory. The second argument %Ptr is a pointer to the vector
+type of %In, and is the start address of the matrix in memory. The third
+argument %Stride is a positive, constant integer with %Stride ``>=`` <Rows>.
+%Stride is used to compute the column memory addresses. I.e., for a column
+``C``, its start memory addresses is calculated with %Ptr + ``C`` * %Stride.
+The fourth argument <IsVolatile> is a boolean value. The arguments <Rows> and
+<Cols> correspond to the number of rows and columns, respectively, and must be
+positive, constant integers.
 
 The :ref:`align <attr_align>` parameter attribute can be provided
 for the %Ptr arguments.
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1458,7 +1458,7 @@
 
 def int_matrix_column_major_load
   : Intrinsic<[llvm_anyvector_ty],
-              [LLVMAnyPointerType<LLVMMatchType<0>>, llvm_i64_ty, llvm_i1_ty,
+              [LLVMPointerToElt<0>, llvm_i64_ty, llvm_i1_ty,
                llvm_i32_ty, llvm_i32_ty],
               [IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem,
                NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>,
@@ -1466,7 +1466,7 @@
 
 def int_matrix_column_major_store
   : Intrinsic<[],
-              [llvm_anyvector_ty, LLVMAnyPointerType<LLVMMatchType<0>>,
+              [llvm_anyvector_ty, LLVMPointerToElt<0>,
                llvm_i64_ty, llvm_i1_ty, llvm_i32_ty, llvm_i32_ty],
               [IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrWriteMem,
                WriteOnly<ArgIndex<1>>, NoCapture<ArgIndex<1>>,
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -5017,36 +5017,73 @@
   case Intrinsic::matrix_transpose:
   case Intrinsic::matrix_column_major_load:
   case Intrinsic::matrix_column_major_store: {
+    Function *IF = Call.getCalledFunction();
+    ConstantInt *Stride = nullptr;
     ConstantInt *NumRows;
     ConstantInt *NumColumns;
-    VectorType *TypeToCheck;
+    VectorType *ResultTy;
+    Type *Op0ElemTy = nullptr;
+    Type *Op1ElemTy = nullptr;
     switch (ID) {
     case Intrinsic::matrix_multiply:
       NumRows = cast<ConstantInt>(Call.getArgOperand(2));
       NumColumns = cast<ConstantInt>(Call.getArgOperand(4));
-      TypeToCheck = cast<VectorType>(Call.getType());
+      ResultTy = cast<VectorType>(Call.getType());
+      Op0ElemTy =
+          cast<VectorType>(Call.getArgOperand(0)->getType())->getElementType();
+      Op1ElemTy =
+          cast<VectorType>(Call.getArgOperand(1)->getType())->getElementType();
       break;
     case Intrinsic::matrix_transpose:
       NumRows = cast<ConstantInt>(Call.getArgOperand(1));
       NumColumns = cast<ConstantInt>(Call.getArgOperand(2));
-      TypeToCheck = cast<VectorType>(Call.getType());
+      ResultTy = cast<VectorType>(Call.getType());
+      Op0ElemTy =
+          cast<VectorType>(Call.getArgOperand(0)->getType())->getElementType();
       break;
     case Intrinsic::matrix_column_major_load:
+      Stride = dyn_cast<ConstantInt>(Call.getArgOperand(1));
       NumRows = cast<ConstantInt>(Call.getArgOperand(3));
       NumColumns = cast<ConstantInt>(Call.getArgOperand(4));
-      TypeToCheck = cast<VectorType>(Call.getType());
+      ResultTy = cast<VectorType>(Call.getType());
+      Op0ElemTy =
+          cast<PointerType>(Call.getArgOperand(0)->getType())->getElementType();
       break;
     case Intrinsic::matrix_column_major_store:
+      Stride = dyn_cast<ConstantInt>(Call.getArgOperand(2));
       NumRows = cast<ConstantInt>(Call.getArgOperand(4));
       NumColumns = cast<ConstantInt>(Call.getArgOperand(5));
-      TypeToCheck = cast<VectorType>(Call.getArgOperand(0)->getType());
+      ResultTy = cast<VectorType>(Call.getArgOperand(0)->getType());
+      Op0ElemTy =
+          cast<VectorType>(Call.getArgOperand(0)->getType())->getElementType();
+      Op1ElemTy =
+          cast<PointerType>(Call.getArgOperand(1)->getType())->getElementType();
       break;
     default:
       llvm_unreachable("unexpected intrinsic");
     }
-    Assert(TypeToCheck->getNumElements() ==
+
+    Assert(ResultTy->getElementType()->isIntegerTy() ||
+           ResultTy->getElementType()->isFloatingPointTy(),
+           "Result type must be an integer or floating-point type!", IF);
+
+    Assert(ResultTy->getElementType() == Op0ElemTy,
+           "Vector element type mismatch of the result and first operand "
+           "vector!", IF);
+
+    if (Op1ElemTy)
+      Assert(ResultTy->getElementType() == Op1ElemTy,
+             "Vector element type mismatch of the result and second operand "
+             "vector!", IF);
+
+    Assert(ResultTy->getNumElements() ==
                NumRows->getZExtValue() * NumColumns->getZExtValue(),
-           "result of a matrix operation does not fit in the returned vector");
+           "Result of a matrix operation does not fit in the returned vector!");
+
+    if (Stride)
+      Assert(Stride->getZExtValue() >= NumRows->getZExtValue(),
+             "Stride must be greater or equal than the number of rows!", IF);
+
     break;
   }
   };
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll
@@ -1,30 +1,29 @@
 ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
 ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
 
-define <9 x double> @strided_load_3x3_volatile(<9 x double>* %in, i64 %stride) {
+define <9 x double> @strided_load_3x3_volatile(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3_volatile(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>*
 ; CHECK-NEXT:    load volatile <3 x double>, <3 x double>* [[VEC_CAST]], align 8
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>*
 ; CHECK-NEXT:    load volatile <3 x double>, <3 x double>* [[VEC_CAST3]], align 8
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>*
 ; CHECK-NEXT:    load volatile <3 x double>, <3 x double>* [[VEC_CAST7]], align 8
 ; CHECK-NOT:     = load
 ;
 entry:
-  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* %in, i64 %stride, i1 true, i32 3, i32 3)
+  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* %in, i64 %stride, i1 true, i32 3, i32 3)
   ret <9 x double> %load
 }
 
-declare <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>*, i64, i1, i32, i32)
+declare <9 x double> @llvm.matrix.column.major.load.v9f64(double*, i64, i1, i32, i32)
 
 define <4 x double> @load_volatile_multiply(<4 x double>* %in) {
 ; CHECK-LABEL: @load_volatile_multiply(
@@ -44,49 +43,47 @@
 declare <4 x double> @llvm.matrix.multiply(<4 x double>, <4 x double>, i32, i32, i32)
 
 
-define <9 x double> @strided_load_3x3_align32(<9 x double>* %in, i64 %stride) {
+define <9 x double> @strided_load_3x3_align32(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3_align32(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST]], align 32
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST3]], align 8
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST7]], align 8
 ; CHECK-NOT:     = load
 ;
 entry:
-  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* align 32 %in, i64 %stride, i1 false, i32 3, i32 3)
+  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* align 32 %in, i64 %stride, i1 false, i32 3, i32 3)
   ret <9 x double> %load
 }
 
-define <9 x double> @strided_load_3x3_align2(<9 x double>* %in, i64 %stride) {
+define <9 x double> @strided_load_3x3_align2(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3_align2(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST]], align 2
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST3]], align 2
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>*
 ; CHECK-NEXT:    load <3 x double>, <3 x double>* [[VEC_CAST7]], align 2
 ; CHECK-NOT:     = load
 ;
 entry:
-  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* align 2 %in, i64 %stride, i1 false, i32 3, i32 3)
+  %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* align 2 %in, i64 %stride, i1 false, i32 3, i32 3)
   ret <9 x double> %load
 }
 
@@ -106,16 +103,15 @@
   ret <4 x double> %res
 }
 
-define <6 x float> @strided_load_2x3_align16_stride2(<6 x float>* %in) {
+define <6 x float> @strided_load_2x3_align16_stride2(float* %in) {
 ; CHECK-LABEL: @strided_load_2x3_align16_stride2(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <6 x float>* [[IN:%.*]] to float*
-; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast float* [[TMP0]] to <2 x float>*
+; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast float* %in to <2 x float>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST]], align 16
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 2
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 2
 ; CHECK-NEXT:    [[VEC_CAST1:%.*]] = bitcast float* [[VEC_GEP]] to <2 x float>*
 ; CHECK-NEXT:    [[COL_LOAD2:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST1]], align 8
-; CHECK-NEXT:    [[VEC_GEP3:%.*]] = getelementptr float, float* [[TMP0]], i64 4
+; CHECK-NEXT:    [[VEC_GEP3:%.*]] = getelementptr float, float* %in, i64 4
 ; CHECK-NEXT:    [[VEC_CAST4:%.*]] = bitcast float* [[VEC_GEP3]] to <2 x float>*
 ; CHECK-NEXT:    [[COL_LOAD5:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST4]], align 16
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <2 x float> [[COL_LOAD]], <2 x float> [[COL_LOAD2]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
@@ -124,8 +120,8 @@
 ; CHECK-NEXT:    ret <6 x float> [[TMP3]]
 ;
 entry:
-  %load = call <6 x float> @llvm.matrix.column.major.load.v6f32(<6 x float>* align 16 %in, i64 2, i1 false, i32 2, i32 3)
+  %load = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* align 16 %in, i64 2, i1 false, i32 2, i32 3)
   ret <6 x float> %load
 }
 
-declare <6 x float> @llvm.matrix.column.major.load.v6f32(<6 x float>*, i64, i1, i32, i32)
+declare <6 x float> @llvm.matrix.column.major.load.v6f32(float*, i64, i1, i32, i32)
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll
@@ -92,10 +92,10 @@
 ; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 8 compute ops
 ; CHECK-NEXT:  transpose.1x2.float(transpose.2x1.float(addr %D))
 
-define void @toplevel(<15 x double>* %A, <15 x double>* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {
+define void @toplevel(<15 x double>* %A, double* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 {
 entry:
   %a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791
-  %b = call <15 x double> @llvm.matrix.column.major.load(<15 x double>* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793
+  %b = call <15 x double> @llvm.matrix.column.major.load(double* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793
   %c  = fadd <15 x double> %a, %b, !dbg !100
   store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102
 
@@ -106,7 +106,7 @@
   ret void
 }
 
-declare <15 x double> @llvm.matrix.column.major.load(<15 x double>*, i64, i1, i32, i32)
+declare <15 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32)
 declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32)
 
 !llvm.dbg.cu = !{!0}
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll
@@ -15,9 +15,6 @@
   ret void
 }
 
-declare <12 x double> @llvm.matrix.transpose.v12f64.v12f64(<12 x double>, i32, i32)
-
-
 ; CHECK-LABEL: remark: test.h:50:20: Lowered with 2 stores, 12 loads, 22 compute ops
 ; CHECK-NEXT:  store(
 ; CHECK-NEXT:   multiply.2x6.6x2.double(
@@ -32,33 +29,27 @@
   ret void
 }
 
-declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32)
-
 ; CHECK-LABEL: remark: test.h:60:20: Lowered with 6 stores, 6 loads, 0 compute ops
 ; CHECK-NEXT:  store(
 ; CHECK-NEXT:   column.major.load.3x3.double(addr %A, 5),
 ; CHECK-NEXT:   addr %B)
-define void @column.major.load(<9 x double>* %A, <9 x double>* %B) !dbg !27 {
-  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !28
+define void @column.major.load(double* %A, <9 x double>* %B) !dbg !27 {
+  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !28
   store <9 x double> %A.matrix, <9 x double>* %B, !dbg !28
   ret void
 }
 
-declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32)
-
 ; CHECK-LABEL: remark: test.h:70:20: Lowered with 6 stores, 6 loads, 0 compute ops
 ; CHECK-NEXT:  column.major.store.3x3.double(
 ; CHECK-NEXT:   column.major.load.3x3.double(addr %A, 5),
 ; CHECK-NEXT:   addr %B,
 ; CHECK-NEXT:   10)
-define void @column.major.store(<9 x double>* %A, <9 x double>* %B) !dbg !29 {
-  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !30
-  call void @llvm.matrix.column.major.store(<9 x double> %A.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !30
+define void @column.major.store(double* %A, double* %B) !dbg !29 {
+  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !30
+  call void @llvm.matrix.column.major.store(<9 x double> %A.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !30
   ret void
 }
 
-declare void @llvm.matrix.column.major.store(<9 x double>, <9 x double>*, i64, i1, i32, i32)
-
 ; CHECK-LABEL: remark: test.h:80:20: Lowered with 6 stores, 6 loads, 12 compute ops
 ; CHECK-NEXT:  column.major.store.3x3.double(
 ; CHECK-NEXT:   fmul(
@@ -69,11 +60,11 @@
 ; CHECK-NEXT:   addr %B,
 ; CHECK-NEXT:   10)
 
-define void @binaryops(<9 x double>* %A, <9 x double>* %B) !dbg !31 {
-  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !32
+define void @binaryops(double* %A, double* %B) !dbg !31 {
+  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !32
   %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !32
   %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !32
-  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !32
+  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !32
   ret void
 }
 
@@ -93,11 +84,11 @@
 ; CHECK-NEXT:    load(addr %D)),
 ; CHECK-NEXT:   addr %E)
 
-define void @multiple_expressions(<9 x double>* %A, <9 x double>* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 {
-  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !34
+define void @multiple_expressions(double* %A, double* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 {
+  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !34
   %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !34
   %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !34
-  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !34
+  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !34
 
   %C.matrix = load <12 x double>, <12 x double>* %C, !dbg !34
   %D.matrix = load <12 x double>, <12 x double>* %D, !dbg !34
@@ -114,14 +105,13 @@
 ; CHECK-NEXT:     column.major.load.3x3.double(addr %A, 5)
 ; CHECK-NEXT:     (reused) column.major.load.3x3.double(addr %A, 5)),
 ; CHECK-NEXT:    (reused) column.major.load.3x3.double(addr %A, 5)),
-; CHECK-NEXT:   stack addr %B,
+; CHECK-NEXT:   addr %B,
 ; CHECK-NEXT:   10)
-define void @stackaddresses(<9 x double>* %A) !dbg !35 {
-  %B = alloca <9 x double>
-  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !36
+define void @stackaddresses(double* %A, double* %B) !dbg !35 {
+  %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !36
   %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !36
   %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !36
-  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !36
+  call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !36
   ret void
 }
 
@@ -146,7 +136,12 @@
   ret void
 }
 
+declare <12 x double> @llvm.matrix.transpose.v12f64.v12f64(<12 x double>, i32, i32)
+declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32)
+declare <9 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32)
 declare <15 x double> @llvm.matrix.transpose.v15f64.v15f64(<15 x double>, i32, i32)
+declare void @llvm.matrix.column.major.store(<9 x double>, double*, i64, i1, i32, i32)
+
 
 !llvm.dbg.cu = !{!0}
 !llvm.module.flags = !{!3, !4}
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll
@@ -2,20 +2,19 @@
 ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
 ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
 
-define <9 x double> @strided_load_3x3(<9 x double>* %in, i64 %stride) {
+define <9 x double> @strided_load_3x3(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST]], align 8
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST3]], align 8
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>*
 ; CHECK-NEXT:    [[COL_LOAD8:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST7]], align 8
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <3 x double> [[COL_LOAD]], <3 x double> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
@@ -24,51 +23,47 @@
 ; CHECK-NEXT:    ret <9 x double> [[TMP3]]
 ;
 entry:
-  %load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 3, i32 3)
+  %load = call <9 x double> @llvm.matrix.column.major.load(double* %in, i64 %stride, i1 false, i32 3, i32 3)
   ret <9 x double> %load
 }
 
-declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32)
+declare <9 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32)
 
-define <9 x double> @strided_load_9x1(<9 x double>* %in, i64 %stride) {
+define <9 x double> @strided_load_9x1(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_9x1(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <9 x double>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <9 x double>, <9 x double>* [[VEC_CAST]], align 8
 ; CHECK-NEXT:    ret <9 x double> [[COL_LOAD]]
 ;
 entry:
-  %load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 9, i32 1)
+  %load = call <9 x double> @llvm.matrix.column.major.load(double* %in, i64 %stride, i1 false, i32 9, i32 1)
   ret <9 x double> %load
 }
 
-declare <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>*, i64, i1, i32, i32)
+declare <8 x double> @llvm.matrix.column.major.load.v8f64(double*, i64, i1, i32, i32)
+; CHECK: declare <8 x double> @llvm.matrix.column.major.load.v8f64(double* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY:#[0-9]]]
 
-define <8 x double> @strided_load_4x2(<8 x double>* %in, i64 %stride) {
+define <8 x double> @strided_load_4x2(double* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_4x2(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <8 x double>* [[IN:%.*]] to double*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <4 x double>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <4 x double>, <4 x double>* [[VEC_CAST]], align 8
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <4 x double>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <4 x double>, <4 x double>* [[VEC_CAST3]], align 8
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x double> [[COL_LOAD]], <4 x double> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    ret <8 x double> [[TMP1]]
 ;
 entry:
-  %load = call <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>* %in, i64 %stride, i1 false, i32 4, i32 2)
+  %load = call <8 x double> @llvm.matrix.column.major.load.v8f64(double* %in, i64 %stride, i1 false, i32 4, i32 2)
   ret <8 x double> %load
 }
 
-; CHECK: declare <9 x double> @llvm.matrix.column.major.load.v9f64.p0v9f64(<9 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY:#[0-9]]]
-
-; CHECK: declare <8 x double> @llvm.matrix.column.major.load.v8f64.p0v8f64(<8 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY]]
-
+; CHECK: declare <9 x double> @llvm.matrix.column.major.load.v9f64(double* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY]]
 ; CHECK: attributes [[READONLY]] = { argmemonly nosync nounwind readonly willreturn }
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll
@@ -2,20 +2,19 @@
 ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
 ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
 
-define <9 x float> @strided_load_3x3(<9 x float>* %in, i64 %stride) {
+define <9 x float> @strided_load_3x3(float* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x float>* [[IN:%.*]] to float*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <3 x float>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr float, float* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast float* [[VEC_GEP2]] to <3 x float>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST3]], align 4
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr float, float* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast float* [[VEC_GEP6]] to <3 x float>*
 ; CHECK-NEXT:    [[COL_LOAD8:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST7]], align 4
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <3 x float> [[COL_LOAD]], <3 x float> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
@@ -24,45 +23,43 @@
 ; CHECK-NEXT:    ret <9 x float> [[TMP3]]
 ;
 entry:
-  %load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 3, i32 3)
+  %load = call <9 x float> @llvm.matrix.column.major.load(float* %in, i64 %stride, i1 false, i32 3, i32 3)
   ret <9 x float> %load
 }
 
-declare <9 x float> @llvm.matrix.column.major.load(<9 x float>*, i64, i1, i32, i32)
+declare <9 x float> @llvm.matrix.column.major.load(float*, i64, i1, i32, i32)
 
-define <9 x float> @strided_load_9x1(<9 x float>* %in, i64 %stride) {
+define <9 x float> @strided_load_9x1(float* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_9x1(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x float>* [[IN:%.*]] to float*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <9 x float>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <9 x float>, <9 x float>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    ret <9 x float> [[COL_LOAD]]
 ;
 entry:
-  %load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 9, i32 1)
+  %load = call <9 x float> @llvm.matrix.column.major.load(float* %in, i64 %stride, i1 false, i32 9, i32 1)
   ret <9 x float> %load
 }
 
-declare <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>*, i64, i1, i32, i32)
+declare <8 x float> @llvm.matrix.column.major.load.v8f32(float*, i64, i1, i32, i32)
 
-define <8 x float> @strided_load_4x2(<8 x float>* %in, i64 %stride) {
+define <8 x float> @strided_load_4x2(float* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_4x2(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <8 x float>* [[IN:%.*]] to float*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <4 x float>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <4 x float>, <4 x float>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr float, float* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast float* [[VEC_GEP2]] to <4 x float>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <4 x float>, <4 x float>* [[VEC_CAST3]], align 4
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x float> [[COL_LOAD]], <4 x float> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    ret <8 x float> [[TMP1]]
 ;
 entry:
-  %load = call <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>* %in, i64 %stride, i1 false, i32 4, i32 2)
+  %load = call <8 x float> @llvm.matrix.column.major.load.v8f32(float* %in, i64 %stride, i1 false, i32 4, i32 2)
   ret <8 x float> %load
 }
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll
@@ -2,20 +2,19 @@
 ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s
 ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s
 
-define <9 x i32> @strided_load_3x3(<9 x i32>* %in, i64 %stride) {
+define <9 x i32> @strided_load_3x3(i32* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_3x3(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x i32>* [[IN:%.*]] to i32*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <3 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast i32* [[VEC_GEP2]] to <3 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST3]], align 4
 ; CHECK-NEXT:    [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START5]]
+; CHECK-NEXT:    [[VEC_GEP6:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START5]]
 ; CHECK-NEXT:    [[VEC_CAST7:%.*]] = bitcast i32* [[VEC_GEP6]] to <3 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD8:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST7]], align 4
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <3 x i32> [[COL_LOAD]], <3 x i32> [[COL_LOAD4]], <6 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5>
@@ -24,45 +23,43 @@
 ; CHECK-NEXT:    ret <9 x i32> [[TMP3]]
 ;
 entry:
-  %load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 3, i32 3)
+  %load = call <9 x i32> @llvm.matrix.column.major.load(i32* %in, i64 %stride, i1 false, i32 3, i32 3)
   ret <9 x i32> %load
 }
 
-declare <9 x i32> @llvm.matrix.column.major.load(<9 x i32>*, i64, i1, i32, i32)
+declare <9 x i32> @llvm.matrix.column.major.load(i32*, i64, i1, i32, i32)
 
-define <9 x i32> @strided_load_9x1(<9 x i32>* %in, i64 %stride) {
+define <9 x i32> @strided_load_9x1(i32* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_9x1(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <9 x i32>* [[IN:%.*]] to i32*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <9 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <9 x i32>, <9 x i32>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    ret <9 x i32> [[COL_LOAD]]
 ;
 entry:
-  %load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 9, i32 1)
+  %load = call <9 x i32> @llvm.matrix.column.major.load(i32* %in, i64 %stride, i1 false, i32 9, i32 1)
   ret <9 x i32> %load
 }
 
-declare <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>*, i64, i1, i32, i32)
+declare <8 x i32> @llvm.matrix.column.major.load.v8i32(i32*, i64, i1, i32, i32)
 
-define <8 x i32> @strided_load_4x2(<8 x i32>* %in, i64 %stride) {
+define <8 x i32> @strided_load_4x2(i32* %in, i64 %stride) {
 ; CHECK-LABEL: @strided_load_4x2(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast <8 x i32>* [[IN:%.*]] to i32*
 ; CHECK-NEXT:    [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]]
-; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]]
+; CHECK-NEXT:    [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]]
 ; CHECK-NEXT:    [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <4 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD:%.*]] = load <4 x i32>, <4 x i32>* [[VEC_CAST]], align 4
 ; CHECK-NEXT:    [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]]
-; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START1]]
+; CHECK-NEXT:    [[VEC_GEP2:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START1]]
 ; CHECK-NEXT:    [[VEC_CAST3:%.*]] = bitcast i32* [[VEC_GEP2]] to <4 x i32>*
 ; CHECK-NEXT:    [[COL_LOAD4:%.*]] = load <4 x i32>, <4 x i32>* [[VEC_CAST3]], align 4
 ; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i32> [[COL_LOAD]], <4 x i32> [[COL_LOAD4]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    ret <8 x i32> [[TMP1]]
 ;
 entry:
-  %load = call <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>* %in, i64 %stride, i1 false, i32 4, i32 2)
+  %load = call <8 x i32> @llvm.matrix.column.major.load.v8i32(i32* %in, i64 %stride, i1 false, i32 4, i32 2)
   ret <8 x i32> %load
 }
diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll
--- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll
+++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll
@@ -13,7 +13,7 @@
 ; CHECK-NEXT:    store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST2]], align 8
 ; CHECK-NEXT:    ret void
 ;
-  call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 5, i1 false, i32 3, i32 2)
+  call void @llvm.matrix.column.major.store.v6f64(<6 x double> %in, double* %out, i64 5, i1 false, i32 3, i32 2)
   ret void
 }
 
@@ -31,13 +31,10 @@
 ; CHECK-NEXT:    store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST4]], align 8
 ; CHECK-NEXT:    ret void
 ;
-  call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 %stride, i1 false, i32 3, i32 2)
+  call void @llvm.matrix.column.major.store.v6f64(<6 x double> %in, double* %out, i64 %stride, i1 false, i32 3, i32 2)
   ret void
 }
 
-
-declare void @llvm.matrix.column.major.store(<6 x double>, double*, i64, i1, i32, i32)
-
 define void @strided_store_2x3(<10 x double> %in, double* %out) {
 ; CHECK-LABEL: @strided_store_2x3(
 ; CHECK-NEXT:    [[SPLIT:%.*]] = shufflevector <10 x double> [[IN:%.*]], <10 x double> undef, <2 x i32> <i32 0, i32 1>
@@ -65,10 +62,9 @@
   ret void
 }
 
+declare void @llvm.matrix.column.major.store.v6f64(<6 x double>, double*, i64, i1, i32, i32)
 declare void @llvm.matrix.column.major.store.v10f64(<10 x double>, double*, i64, i1, i32, i32)
 
-; CHECK: declare void @llvm.matrix.column.major.store.v6f64.p0f64(<6 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY:#[0-9]]]
-
-; CHECK: declare void @llvm.matrix.column.major.store.v10f64.p0f64(<10 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY]]
-
-; CHECK: attributes [[WRITEONLY]] = { argmemonly nosync nounwind willreturn writeonly }
+; CHECK: declare void @llvm.matrix.column.major.store.v6f64(<6 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) #0 
+; CHECK: declare void @llvm.matrix.column.major.store.v10f64(<10 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) #0 
+; CHECK: attributes #0 = { argmemonly nosync nounwind willreturn writeonly }
diff --git a/llvm/test/Verifier/matrix-intrinsics.ll b/llvm/test/Verifier/matrix-intrinsics.ll
--- a/llvm/test/Verifier/matrix-intrinsics.ll
+++ b/llvm/test/Verifier/matrix-intrinsics.ll
@@ -1,11 +1,10 @@
 ; RUN: not llvm-as < %s -o /dev/null 2>&1 | FileCheck %s
 
-declare <4 x float> @llvm.matrix.transpose.v4f32(<4 x float>, i32, i32)
 define <4 x float> @transpose(<4 x float> %m, i32 %arg) {
 ; CHECK: assembly parsed, but does not verify as correct!
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
 ; CHECK-NEXT: immarg operand has non-immediate parameter
 ; CHECK-NEXT: i32 %arg
 ; CHECK-NEXT:   %result.3 = call <4 x float> @llvm.matrix.transpose.v4f32(<4 x float> %result.2, i32 %arg, i32 2)
@@ -20,11 +19,10 @@
   ret <4 x float> %result.4
 }
 
-declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32)
 define <4 x float> @multiply(<4 x float> %m, i32 %arg) {
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
 ; CHECK-NEXT: immarg operand has non-immediate parameter
 ; CHECK-NEXT: i32 %arg
 ; CHECK-NEXT:   %result.3 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %result.2, <4 x float> %m, i32 %arg, i32 2, i32 1)
@@ -35,32 +33,130 @@
   ret <4 x float> %result.3
 }
 
-declare <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>*, i64, i1, i32, i32)
-declare <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>*, i64, i1, i32, i32)
-define <4 x float> @column.major_load(<4 x float>* %m, <6 x float>* %n, i32 %arg) {
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
+define <4 x float> @column.major_load(float* %m, float* %n, i32 %arg) {
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
 ; CHECK-NEXT: immarg operand has non-immediate parameter
 ; CHECK-NEXT: i32 %arg
-; CHECK-NEXT:   %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 %arg)
-  %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>* %m, i64 0, i1 false, i32 0, i32 0)
-  %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>* %m, i64 2, i1 false, i32 1, i32 2)
-  %result.2 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 3)
-  %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 %arg)
+; CHECK-NEXT:   %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 %arg)
+  %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 0, i1 false, i32 0, i32 0)
+  %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 2, i1 false, i32 1, i32 2)
+  %result.2 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 3)
+  %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 %arg)
   ret <4 x float> %result.1
 }
 
-declare void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i64, i1, i32, i32)
-declare void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float>, <6 x float>*, i64, i1, i32, i32)
-define void @column.major_store(<4 x float>* %m, <6 x float>* %n, i64 %arg) {
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-; CHECK-NEXT: result of a matrix operation does not fit in the returned vector
-  call void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i64 0, i1 false, i32 0, i32 0)
-  call void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i64 2, i1 false, i32 1, i32 2)
-  call void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i64 2, i1 false, i32 3, i32 3)
-  call void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i64 %arg, i1 false, i32 3, i32 3)
+define void @column.major_store(float* %m, float* %n, i64 %arg) {
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector!
+  call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 0, i1 false, i32 0, i32 0)
+  call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 2, i1 false, i32 1, i32 2)
+  call void @llvm.matrix.column.major.store.v6f32(<6 x float> zeroinitializer, float* %n, i64 2, i1 false, i32 3, i32 3)
+  call void @llvm.matrix.column.major.store.v6f32(<6 x float> zeroinitializer, float* %n, i64 %arg, i1 false, i32 3, i32 3)
+  ret void
+}
+
+define <4 x float> @transpose_mixed_types(<4 x float> %fvec, <4 x i32> %ivec, i32 %arg) {
+;
+; CHECK-NEXT: Intrinsic has incorrect argument type!
+; CHECK-NEXT: <4 x float> (<4 x i32>, i32, i32)* @llvm.matrix.transpose.v4f32.v4i32
+; CHECK-NEXT: Intrinsic has incorrect argument type!
+; CHECK-NEXT: <4 x i32> (<4 x float>, i32, i32)* @llvm.matrix.transpose.v4i32.v4f32
+;
+  %result.0 = call <4 x float> @llvm.matrix.transpose.v4f32.v4i32(<4 x i32> %ivec, i32 0, i32 0)
+  %result.1 = call <4 x i32> @llvm.matrix.transpose.v4i32.v4f32(<4 x float> %result.0, i32 3, i32 2)
+  ret <4 x float> %result.0
+}
+
+define <4 x float> @multiply_mixed_types(<4 x i32> %ivec, <4 x float> %fvec, i32 %arg) {
+;
+; CHECK-NEXT: Vector element type mismatch of the result and first operand vector!
+; CHECK-NEXT: <4 x i32> (<4 x float>, <4 x float>, i32, i32, i32)* @llvm.matrix.multiply.v4i32.v4f32.v4f32
+; CHECK-NEXT: Vector element type mismatch of the result and first operand vector!
+; CHECK-NEXT: <4 x float> (<4 x i32>, <4 x float>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4i32.v4f32
+; CHECK-NEXT: Vector element type mismatch of the result and second operand vector!
+; CHECK-NEXT: <4 x float> (<4 x float>, <4 x i32>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4f32.v4i32
+; CHECK-NEXT: Vector element type mismatch of the result and first operand vector!
+; CHECK-NEXT: <4 x float> (<4 x i32>, <4 x i32>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4i32.v4i32
+;
+  %result.0 = call <4 x i32> @llvm.matrix.multiply.v4i32.v4f32.v4f32(<4 x float> %fvec, <4 x float> %fvec, i32 2, i32 2, i32 2)
+  %result.1 = call <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4f32(<4 x i32> %result.0, <4 x float> %fvec, i32 2, i32 2, i32 2)
+  %result.2 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4i32(<4 x float> %fvec, <4 x i32> %ivec, i32 2, i32 2, i32 2)
+  %result.3 = call <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4i32(<4 x i32> %ivec, <4 x i32> %ivec, i32 2, i32 2, i32 2)
+  ret <4 x float> %result.3
+}
+
+define <4 x float> @column.major_load_mixed_types(i32* %m, float* %n, i32 %arg) {
+;
+; CHECK-NEXT: Intrinsic has incorrect argument type!
+; CHECK-NEXT: <4 x float> (i32*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4f32.pi32
+; CHECK-NEXT: Intrinsic has incorrect argument type!
+; CHECK-NEXT: <4 x i32> (float*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4i32
+;
+  %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32.pi32(i32* %m, i64 2, i1 false, i32 2, i32 2)
+  %result.1 = call <4 x i32> @llvm.matrix.column.major.load.v4i32(float* %n, i64 2, i1 false, i32 2, i32 2)
+  ret <4 x float> %result.0
+}
+
+define void @column.major_store_mixed_types(float* %m, i32* %n, i64 %arg) {
+;
+; CHECK-NEXT: Intrinsic has incorrect argument type! 
+; CHECK-NEXT: void (<4 x i32>, float*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4i32.vi32
+; CHECK-NEXT: Intrinsic has incorrect argument type! 
+; CHECK-NEXT: void (<4 x float>, i32*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32.pi32
+;
+  call void @llvm.matrix.column.major.store.v4i32.vi32(<4 x i32> zeroinitializer, float* %m, i64 2, i1 false, i32 2, i32 2)
+  call void @llvm.matrix.column.major.store.v4f32.pi32(<4 x float> zeroinitializer, i32* %n, i64 2, i1 false, i32 2, i32 2)
   ret void
 }
+
+define void @column.major_store_non_int_float_type(<4 x float>* %m, <4 x float>* %n, i64 %arg) {
+;
+; CHECK-NEXT: Intrinsic has incorrect argument type!
+; CHECK-NEXT: void (<4 x float*>, <4 x float>*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32p0.p0v4f32
+;
+  call void @llvm.matrix.column.major.store.v4f32p0.p0v4f32(<4 x float*> zeroinitializer, <4 x float>* %n, i64 2, i1 false, i32 2, i32 2)
+  ret void
+}
+
+define <4 x float> @column.major_load_stride_too_small(float* %m, i32 %arg) {
+;
+; CHECK-NEXT: Stride must be greater or equal than the number of rows!
+; CHECK-NEXT: <4 x float> (float*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4f32
+;
+  %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 1, i1 false, i32 2, i32 2)
+  ret <4 x float> %result.1
+}
+
+define void @column.major_store_stride_too_small(float* %m, i64 %arg) {
+;
+; CHECK-NEXT: Stride must be greater or equal than the number of rows!
+; CHECK-NEXT: void (<4 x float>, float*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32
+;
+  call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 1, i1 false, i32 2, i32 2)
+  ret void
+}
+
+declare <4 x i32>   @llvm.matrix.column.major.load.v4i32(float*, i64, i1, i32, i32)
+declare <4 x float> @llvm.matrix.column.major.load.v4f32.pi32(i32*, i64, i1, i32, i32)
+declare <4 x float> @llvm.matrix.column.major.load.v4f32(float*, i64, i1, i32, i32)
+declare <6 x float> @llvm.matrix.column.major.load.v6f32(float*, i64, i1, i32, i32)
+
+declare void @llvm.matrix.column.major.store.v4f32(<4 x float>, float*, i64, i1, i32, i32)
+declare void @llvm.matrix.column.major.store.v6f32(<6 x float>, float*, i64, i1, i32, i32)
+declare void @llvm.matrix.column.major.store.v4i32.vi32(<4 x i32>, float*, i64, i1, i32, i32)
+declare void @llvm.matrix.column.major.store.v4f32.pi32(<4 x float>, i32*, i64, i1, i32, i32)
+declare void @llvm.matrix.column.major.store.v4f32p0.p0v4f32(<4 x float*>, <4 x float>*, i64, i1, i32, i32)
+
+declare <4 x i32>   @llvm.matrix.transpose.v4i32.v4f32(<4 x float>, i32, i32)
+declare <4 x float> @llvm.matrix.transpose.v4f32(<4 x float>, i32, i32)
+declare <4 x float> @llvm.matrix.transpose.v4f32.v4i32(<4 x i32>, i32, i32)
+
+declare <4 x i32>   @llvm.matrix.multiply.v4i32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32)
+declare <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4f32(<4 x i32>, <4 x float>, i32, i32, i32)
+declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4i32(<4 x float>, <4 x i32>, i32, i32, i32)
+declare <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32)
+declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32)