diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst --- a/llvm/docs/LangRef.rst +++ b/llvm/docs/LangRef.rst @@ -15525,6 +15525,7 @@ Syntax: """"""" +This is an overloaded intrinsic. :: @@ -15549,17 +15550,20 @@ ----------------- Operations on matrixes requiring shape information (like number of rows/columns -or the memory layout) can be expressed using the matrix intrinsics. Matrixes are -embedded in a flat vector and the intrinsics take the dimensions as arguments. -Currently column-major layout is assumed. The intrinsics support both integer -and floating point matrixes. +or the memory layout) can be expressed using the matrix intrinsics. These +intrinsics require matrix dimensions to be passed as immediate arguments, and +matrixes are passed and returned as vectors. This means that for a ``R`` x +``C`` matrix, element ``i`` of column ``j`` is at index ``j * R + i`` in the +corresponding vector, with indices starting at 0. Currently column-major layout +is assumed. The intrinsics support both integer and floating point matrixes. '``llvm.matrix.transpose.*``' Intrinsic -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Syntax: """"""" +This is an overloaded intrinsic. :: @@ -15568,21 +15572,24 @@ Overview: """"""""" -The '``llvm.matrix.transpose.*``' intrinsic treats %In as containing a matrix -with rows and columns and returns the transposed matrix embedded in -the result vector. +The '``llvm.matrix.transpose.*``' intrinsics treat %In as a x matrix +and return the transposed matrix in the result vector. Arguments: """""""""" -The and arguments must be constant integers. The vector argument -%In and the returned vector must have * elements. +First argument %In is vector that corresponds to a x matrix. +Thus, arguments and correspond to the number of rows and columns, +respectively, and must be positive, constant integers. The returned vector must +have * elements, and have the same float or integer element type +as %In. '``llvm.matrix.multiply.*``' Intrinsic -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Syntax: """"""" +This is an overloaded intrinsic. :: @@ -15591,18 +15598,19 @@ Overview: """"""""" -The '``llvm.matrix.multiply.*``' intrinsic treats %A as a matrix with -rows and columns, %B as a matrix with rows and -columns and multiplies them. The result matrix is returned embedded in the -result vector. +The '``llvm.matrix.multiply.*``' intrinsics treat %A as a x +matrix, %B as a x matrix, and multiplies them. The result +matrix is returned in the result vector. Arguments: """""""""" -The , and arguments must be constant -integers. The vector argument %A must have * elements, %B -must have * elements and the returned vector must have - * elements. +The first vector argument %A corresponds to a matrix with * +elements, and the second argument %B to a matrix with * +elements. Arguments , and must be positive, +constant integers. The returned vector must have * +elements. Vectors %A, %B, and the returned vector all have the same float or +integer element type. '``llvm.matrix.column.major.load.*``' Intrinsic @@ -15610,6 +15618,7 @@ Syntax: """"""" +This is an overloaded intrinsic. :: @@ -15619,22 +15628,26 @@ Overview: """"""""" -The '``llvm.matrix.column.major.load.*``' intrinsic loads a matrix with -rows and columns, using a stride of %Stride between columns. For two -consecutive columns A and B, %Stride refers to the distance (the number of -elements) between the start of column A and the start of column B. The result -matrix is returned embedded in the result vector. This allows for convenient -loading of sub matrixes. If is true, the intrinsic is considered -a :ref:`volatile memory access `. - -If the %Ptr argument is known to be aligned to some boundary, this can be -specified as an attribute on the argument. +The '``llvm.matrix.column.major.load.*``' intrinsics load a x +matrix using a stride of %Stride to compute the start address of the different +columns. This allows for convenient loading of sub matrixes. If +is true, the intrinsic is considered a :ref:`volatile memory access +`. The result matrix is returned in the result vector. If the %Ptr +argument is known to be aligned to some boundary, this can be specified as an +attribute on the argument. Arguments: """""""""" -The , and arguments must be constant integers. The -returned vector must have * elements. %Stride must be >= . +The first argument %Ptr is a pointer type to the returned vector type, and +correponds to the start address to load from. The second argument %Stride is a +postive, constant integer with %Stride ``>=`` . %Stride is used to compute +the column memory addresses. I.e., for a column ``C``, its start memory +addresses is calculated with %Ptr + ``C`` * %Stride. The third Argument + is a boolean value. The fourth and fifth arguments, and +, correspond to the number of rows and columns, respectively, and must be +positive, constant integers. The returned vector must have * +elements. The :ref:`align ` parameter attribute can be provided for the %Ptr arguments. @@ -15654,12 +15667,10 @@ Overview: """"""""" -The '``llvm.matrix.column.major.store.*``' intrinsic stores the matrix with - rows and columns embedded in %In, using a stride of %Stride -between columns. For two consecutive columns A and B, %Stride refers to the -distance (the number of elements) between the start of column A and the start -of column B. If is true, the intrinsic is considered a -:ref:`volatile memory access `. +The '``llvm.matrix.column.major.store.*``' intrinsics store the x +matrix in %In to memory using a stride of %Stride between columns. If + is true, the intrinsic is considered a :ref:`volatile memory +access `. If the %Ptr argument is known to be aligned to some boundary, this can be specified as an attribute on the argument. @@ -15667,8 +15678,15 @@ Arguments: """""""""" -The , , arguments must be constant integers. The -vector argument %In must have * elements. %Stride must be >= . +The first argument %In is a vector that corresponds to a x matrix +to be stored to memory. The second argument %Ptr is a pointer to the vector +type of %In, and is the start address of the matrix in memory. The third +argument %Stride is a positive, constant integer with %Stride ``>=`` . +%Stride is used to compute the column memory addresses. I.e., for a column +``C``, its start memory addresses is calculated with %Ptr + ``C`` * %Stride. +The fourth argument is a boolean value. The arguments and + correspond to the number of rows and columns, respectively, and must be +positive, constant integers. The :ref:`align ` parameter attribute can be provided for the %Ptr arguments. diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td --- a/llvm/include/llvm/IR/Intrinsics.td +++ b/llvm/include/llvm/IR/Intrinsics.td @@ -1458,7 +1458,7 @@ def int_matrix_column_major_load : Intrinsic<[llvm_anyvector_ty], - [LLVMAnyPointerType>, llvm_i64_ty, llvm_i1_ty, + [LLVMPointerToElt<0>, llvm_i64_ty, llvm_i1_ty, llvm_i32_ty, llvm_i32_ty], [IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem, NoCapture>, ImmArg>, ImmArg>, @@ -1466,7 +1466,7 @@ def int_matrix_column_major_store : Intrinsic<[], - [llvm_anyvector_ty, LLVMAnyPointerType>, + [llvm_anyvector_ty, LLVMPointerToElt<0>, llvm_i64_ty, llvm_i1_ty, llvm_i32_ty, llvm_i32_ty], [IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrWriteMem, WriteOnly>, NoCapture>, diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp --- a/llvm/lib/IR/Verifier.cpp +++ b/llvm/lib/IR/Verifier.cpp @@ -5017,36 +5017,73 @@ case Intrinsic::matrix_transpose: case Intrinsic::matrix_column_major_load: case Intrinsic::matrix_column_major_store: { + Function *IF = Call.getCalledFunction(); + ConstantInt *Stride = nullptr; ConstantInt *NumRows; ConstantInt *NumColumns; - VectorType *TypeToCheck; + VectorType *ResultTy; + Type *Op0ElemTy = nullptr; + Type *Op1ElemTy = nullptr; switch (ID) { case Intrinsic::matrix_multiply: NumRows = cast(Call.getArgOperand(2)); NumColumns = cast(Call.getArgOperand(4)); - TypeToCheck = cast(Call.getType()); + ResultTy = cast(Call.getType()); + Op0ElemTy = + cast(Call.getArgOperand(0)->getType())->getElementType(); + Op1ElemTy = + cast(Call.getArgOperand(1)->getType())->getElementType(); break; case Intrinsic::matrix_transpose: NumRows = cast(Call.getArgOperand(1)); NumColumns = cast(Call.getArgOperand(2)); - TypeToCheck = cast(Call.getType()); + ResultTy = cast(Call.getType()); + Op0ElemTy = + cast(Call.getArgOperand(0)->getType())->getElementType(); break; case Intrinsic::matrix_column_major_load: + Stride = dyn_cast(Call.getArgOperand(1)); NumRows = cast(Call.getArgOperand(3)); NumColumns = cast(Call.getArgOperand(4)); - TypeToCheck = cast(Call.getType()); + ResultTy = cast(Call.getType()); + Op0ElemTy = + cast(Call.getArgOperand(0)->getType())->getElementType(); break; case Intrinsic::matrix_column_major_store: + Stride = dyn_cast(Call.getArgOperand(2)); NumRows = cast(Call.getArgOperand(4)); NumColumns = cast(Call.getArgOperand(5)); - TypeToCheck = cast(Call.getArgOperand(0)->getType()); + ResultTy = cast(Call.getArgOperand(0)->getType()); + Op0ElemTy = + cast(Call.getArgOperand(0)->getType())->getElementType(); + Op1ElemTy = + cast(Call.getArgOperand(1)->getType())->getElementType(); break; default: llvm_unreachable("unexpected intrinsic"); } - Assert(TypeToCheck->getNumElements() == + + Assert(ResultTy->getElementType()->isIntegerTy() || + ResultTy->getElementType()->isFloatingPointTy(), + "Result type must be an integer or floating-point type!", IF); + + Assert(ResultTy->getElementType() == Op0ElemTy, + "Vector element type mismatch of the result and first operand " + "vector!", IF); + + if (Op1ElemTy) + Assert(ResultTy->getElementType() == Op1ElemTy, + "Vector element type mismatch of the result and second operand " + "vector!", IF); + + Assert(ResultTy->getNumElements() == NumRows->getZExtValue() * NumColumns->getZExtValue(), - "result of a matrix operation does not fit in the returned vector"); + "Result of a matrix operation does not fit in the returned vector!"); + + if (Stride) + Assert(Stride->getZExtValue() >= NumRows->getZExtValue(), + "Stride must be greater or equal than the number of rows!", IF); + break; } }; diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/load-align-volatile.ll @@ -1,30 +1,29 @@ ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s -define <9 x double> @strided_load_3x3_volatile(<9 x double>* %in, i64 %stride) { +define <9 x double> @strided_load_3x3_volatile(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3_volatile( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>* ; CHECK-NEXT: load volatile <3 x double>, <3 x double>* [[VEC_CAST]], align 8 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>* ; CHECK-NEXT: load volatile <3 x double>, <3 x double>* [[VEC_CAST3]], align 8 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>* ; CHECK-NEXT: load volatile <3 x double>, <3 x double>* [[VEC_CAST7]], align 8 ; CHECK-NOT: = load ; entry: - %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* %in, i64 %stride, i1 true, i32 3, i32 3) + %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* %in, i64 %stride, i1 true, i32 3, i32 3) ret <9 x double> %load } -declare <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>*, i64, i1, i32, i32) +declare <9 x double> @llvm.matrix.column.major.load.v9f64(double*, i64, i1, i32, i32) define <4 x double> @load_volatile_multiply(<4 x double>* %in) { ; CHECK-LABEL: @load_volatile_multiply( @@ -44,49 +43,47 @@ declare <4 x double> @llvm.matrix.multiply(<4 x double>, <4 x double>, i32, i32, i32) -define <9 x double> @strided_load_3x3_align32(<9 x double>* %in, i64 %stride) { +define <9 x double> @strided_load_3x3_align32(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3_align32( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST]], align 32 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST3]], align 8 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST7]], align 8 ; CHECK-NOT: = load ; entry: - %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* align 32 %in, i64 %stride, i1 false, i32 3, i32 3) + %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* align 32 %in, i64 %stride, i1 false, i32 3, i32 3) ret <9 x double> %load } -define <9 x double> @strided_load_3x3_align2(<9 x double>* %in, i64 %stride) { +define <9 x double> @strided_load_3x3_align2(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3_align2( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST]], align 2 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST3]], align 2 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>* ; CHECK-NEXT: load <3 x double>, <3 x double>* [[VEC_CAST7]], align 2 ; CHECK-NOT: = load ; entry: - %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(<9 x double>* align 2 %in, i64 %stride, i1 false, i32 3, i32 3) + %load = call <9 x double> @llvm.matrix.column.major.load.v9f64(double* align 2 %in, i64 %stride, i1 false, i32 3, i32 3) ret <9 x double> %load } @@ -106,16 +103,15 @@ ret <4 x double> %res } -define <6 x float> @strided_load_2x3_align16_stride2(<6 x float>* %in) { +define <6 x float> @strided_load_2x3_align16_stride2(float* %in) { ; CHECK-LABEL: @strided_load_2x3_align16_stride2( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <6 x float>* [[IN:%.*]] to float* -; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast float* [[TMP0]] to <2 x float>* +; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast float* %in to <2 x float>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST]], align 16 -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 2 +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 2 ; CHECK-NEXT: [[VEC_CAST1:%.*]] = bitcast float* [[VEC_GEP]] to <2 x float>* ; CHECK-NEXT: [[COL_LOAD2:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST1]], align 8 -; CHECK-NEXT: [[VEC_GEP3:%.*]] = getelementptr float, float* [[TMP0]], i64 4 +; CHECK-NEXT: [[VEC_GEP3:%.*]] = getelementptr float, float* %in, i64 4 ; CHECK-NEXT: [[VEC_CAST4:%.*]] = bitcast float* [[VEC_GEP3]] to <2 x float>* ; CHECK-NEXT: [[COL_LOAD5:%.*]] = load <2 x float>, <2 x float>* [[VEC_CAST4]], align 16 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x float> [[COL_LOAD]], <2 x float> [[COL_LOAD2]], <4 x i32> @@ -124,8 +120,8 @@ ; CHECK-NEXT: ret <6 x float> [[TMP3]] ; entry: - %load = call <6 x float> @llvm.matrix.column.major.load.v6f32(<6 x float>* align 16 %in, i64 2, i1 false, i32 2, i32 3) + %load = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* align 16 %in, i64 2, i1 false, i32 2, i32 3) ret <6 x float> %load } -declare <6 x float> @llvm.matrix.column.major.load.v6f32(<6 x float>*, i64, i1, i32, i32) +declare <6 x float> @llvm.matrix.column.major.load.v6f32(float*, i64, i1, i32, i32) diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks-inlining.ll @@ -92,10 +92,10 @@ ; CHECK-LABEL: remark: transpose.h:13:11: Lowered with 0 stores, 0 loads, 8 compute ops ; CHECK-NEXT: transpose.1x2.float(transpose.2x1.float(addr %D)) -define void @toplevel(<15 x double>* %A, <15 x double>* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 { +define void @toplevel(<15 x double>* %A, double* %B, <15 x double>* %C, <2 x float>* %D) !dbg !16 { entry: %a = load <15 x double>, <15 x double> *%A, align 16, !dbg !3791 - %b = call <15 x double> @llvm.matrix.column.major.load(<15 x double>* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793 + %b = call <15 x double> @llvm.matrix.column.major.load(double* %B, i64 5, i1 false, i32 3, i32 5), !dbg !3793 %c = fadd <15 x double> %a, %b, !dbg !100 store <15 x double> %c, <15 x double> *%C, align 16, !dbg !102 @@ -106,7 +106,7 @@ ret void } -declare <15 x double> @llvm.matrix.column.major.load(<15 x double>*, i64, i1, i32, i32) +declare <15 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32) declare <2 x float> @llvm.matrix.transpose(<2 x float>, i32, i32) !llvm.dbg.cu = !{!0} diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/remarks.ll @@ -15,9 +15,6 @@ ret void } -declare <12 x double> @llvm.matrix.transpose.v12f64.v12f64(<12 x double>, i32, i32) - - ; CHECK-LABEL: remark: test.h:50:20: Lowered with 2 stores, 12 loads, 22 compute ops ; CHECK-NEXT: store( ; CHECK-NEXT: multiply.2x6.6x2.double( @@ -32,33 +29,27 @@ ret void } -declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32) - ; CHECK-LABEL: remark: test.h:60:20: Lowered with 6 stores, 6 loads, 0 compute ops ; CHECK-NEXT: store( ; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5), ; CHECK-NEXT: addr %B) -define void @column.major.load(<9 x double>* %A, <9 x double>* %B) !dbg !27 { - %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !28 +define void @column.major.load(double* %A, <9 x double>* %B) !dbg !27 { + %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !28 store <9 x double> %A.matrix, <9 x double>* %B, !dbg !28 ret void } -declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32) - ; CHECK-LABEL: remark: test.h:70:20: Lowered with 6 stores, 6 loads, 0 compute ops ; CHECK-NEXT: column.major.store.3x3.double( ; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5), ; CHECK-NEXT: addr %B, ; CHECK-NEXT: 10) -define void @column.major.store(<9 x double>* %A, <9 x double>* %B) !dbg !29 { - %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !30 - call void @llvm.matrix.column.major.store(<9 x double> %A.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !30 +define void @column.major.store(double* %A, double* %B) !dbg !29 { + %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !30 + call void @llvm.matrix.column.major.store(<9 x double> %A.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !30 ret void } -declare void @llvm.matrix.column.major.store(<9 x double>, <9 x double>*, i64, i1, i32, i32) - ; CHECK-LABEL: remark: test.h:80:20: Lowered with 6 stores, 6 loads, 12 compute ops ; CHECK-NEXT: column.major.store.3x3.double( ; CHECK-NEXT: fmul( @@ -69,11 +60,11 @@ ; CHECK-NEXT: addr %B, ; CHECK-NEXT: 10) -define void @binaryops(<9 x double>* %A, <9 x double>* %B) !dbg !31 { - %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !32 +define void @binaryops(double* %A, double* %B) !dbg !31 { + %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !32 %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !32 %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !32 - call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !32 + call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !32 ret void } @@ -93,11 +84,11 @@ ; CHECK-NEXT: load(addr %D)), ; CHECK-NEXT: addr %E) -define void @multiple_expressions(<9 x double>* %A, <9 x double>* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 { - %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !34 +define void @multiple_expressions(double* %A, double* %B, <12 x double>* %C, <12 x double>* %D, <4 x double>* %E) !dbg !33 { + %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !34 %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !34 %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !34 - call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !34 + call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !34 %C.matrix = load <12 x double>, <12 x double>* %C, !dbg !34 %D.matrix = load <12 x double>, <12 x double>* %D, !dbg !34 @@ -114,14 +105,13 @@ ; CHECK-NEXT: column.major.load.3x3.double(addr %A, 5) ; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)), ; CHECK-NEXT: (reused) column.major.load.3x3.double(addr %A, 5)), -; CHECK-NEXT: stack addr %B, +; CHECK-NEXT: addr %B, ; CHECK-NEXT: 10) -define void @stackaddresses(<9 x double>* %A) !dbg !35 { - %B = alloca <9 x double> - %A.matrix = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %A, i64 5, i1 false, i32 3, i32 3), !dbg !36 +define void @stackaddresses(double* %A, double* %B) !dbg !35 { + %A.matrix = call <9 x double> @llvm.matrix.column.major.load(double* %A, i64 5, i1 false, i32 3, i32 3), !dbg !36 %R1.matrix = fadd <9 x double> %A.matrix, %A.matrix, !dbg !36 %R2.matrix = fmul <9 x double> %R1.matrix, %A.matrix, !dbg !36 - call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, <9 x double>* %B, i64 10, i1 false, i32 3, i32 3), !dbg !36 + call void @llvm.matrix.column.major.store(<9 x double> %R2.matrix, double* %B, i64 10, i1 false, i32 3, i32 3), !dbg !36 ret void } @@ -146,7 +136,12 @@ ret void } +declare <12 x double> @llvm.matrix.transpose.v12f64.v12f64(<12 x double>, i32, i32) +declare <4 x double> @llvm.matrix.multiply(<12 x double>, <12 x double>, i32, i32, i32) +declare <9 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32) declare <15 x double> @llvm.matrix.transpose.v15f64.v15f64(<15 x double>, i32, i32) +declare void @llvm.matrix.column.major.store(<9 x double>, double*, i64, i1, i32, i32) + !llvm.dbg.cu = !{!0} !llvm.module.flags = !{!3, !4} diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-double.ll @@ -2,20 +2,19 @@ ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s -define <9 x double> @strided_load_3x3(<9 x double>* %in, i64 %stride) { +define <9 x double> @strided_load_3x3(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <3 x double>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST]], align 8 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <3 x double>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST3]], align 8 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr double, double* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast double* [[VEC_GEP6]] to <3 x double>* ; CHECK-NEXT: [[COL_LOAD8:%.*]] = load <3 x double>, <3 x double>* [[VEC_CAST7]], align 8 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x double> [[COL_LOAD]], <3 x double> [[COL_LOAD4]], <6 x i32> @@ -24,51 +23,47 @@ ; CHECK-NEXT: ret <9 x double> [[TMP3]] ; entry: - %load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 3, i32 3) + %load = call <9 x double> @llvm.matrix.column.major.load(double* %in, i64 %stride, i1 false, i32 3, i32 3) ret <9 x double> %load } -declare <9 x double> @llvm.matrix.column.major.load(<9 x double>*, i64, i1, i32, i32) +declare <9 x double> @llvm.matrix.column.major.load(double*, i64, i1, i32, i32) -define <9 x double> @strided_load_9x1(<9 x double>* %in, i64 %stride) { +define <9 x double> @strided_load_9x1(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_9x1( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <9 x double>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <9 x double>, <9 x double>* [[VEC_CAST]], align 8 ; CHECK-NEXT: ret <9 x double> [[COL_LOAD]] ; entry: - %load = call <9 x double> @llvm.matrix.column.major.load(<9 x double>* %in, i64 %stride, i1 false, i32 9, i32 1) + %load = call <9 x double> @llvm.matrix.column.major.load(double* %in, i64 %stride, i1 false, i32 9, i32 1) ret <9 x double> %load } -declare <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>*, i64, i1, i32, i32) +declare <8 x double> @llvm.matrix.column.major.load.v8f64(double*, i64, i1, i32, i32) +; CHECK: declare <8 x double> @llvm.matrix.column.major.load.v8f64(double* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY:#[0-9]]] -define <8 x double> @strided_load_4x2(<8 x double>* %in, i64 %stride) { +define <8 x double> @strided_load_4x2(double* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_4x2( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <8 x double>* [[IN:%.*]] to double* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr double, double* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast double* [[VEC_GEP]] to <4 x double>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <4 x double>, <4 x double>* [[VEC_CAST]], align 8 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr double, double* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast double* [[VEC_GEP2]] to <4 x double>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <4 x double>, <4 x double>* [[VEC_CAST3]], align 8 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x double> [[COL_LOAD]], <4 x double> [[COL_LOAD4]], <8 x i32> ; CHECK-NEXT: ret <8 x double> [[TMP1]] ; entry: - %load = call <8 x double> @llvm.matrix.column.major.load.v8f64(<8 x double>* %in, i64 %stride, i1 false, i32 4, i32 2) + %load = call <8 x double> @llvm.matrix.column.major.load.v8f64(double* %in, i64 %stride, i1 false, i32 4, i32 2) ret <8 x double> %load } -; CHECK: declare <9 x double> @llvm.matrix.column.major.load.v9f64.p0v9f64(<9 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY:#[0-9]]] - -; CHECK: declare <8 x double> @llvm.matrix.column.major.load.v8f64.p0v8f64(<8 x double>* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY]] - +; CHECK: declare <9 x double> @llvm.matrix.column.major.load.v9f64(double* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) [[READONLY]] ; CHECK: attributes [[READONLY]] = { argmemonly nosync nounwind readonly willreturn } diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-float.ll @@ -2,20 +2,19 @@ ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s -define <9 x float> @strided_load_3x3(<9 x float>* %in, i64 %stride) { +define <9 x float> @strided_load_3x3(float* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x float>* [[IN:%.*]] to float* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <3 x float>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST]], align 4 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr float, float* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast float* [[VEC_GEP2]] to <3 x float>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST3]], align 4 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr float, float* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast float* [[VEC_GEP6]] to <3 x float>* ; CHECK-NEXT: [[COL_LOAD8:%.*]] = load <3 x float>, <3 x float>* [[VEC_CAST7]], align 4 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x float> [[COL_LOAD]], <3 x float> [[COL_LOAD4]], <6 x i32> @@ -24,45 +23,43 @@ ; CHECK-NEXT: ret <9 x float> [[TMP3]] ; entry: - %load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 3, i32 3) + %load = call <9 x float> @llvm.matrix.column.major.load(float* %in, i64 %stride, i1 false, i32 3, i32 3) ret <9 x float> %load } -declare <9 x float> @llvm.matrix.column.major.load(<9 x float>*, i64, i1, i32, i32) +declare <9 x float> @llvm.matrix.column.major.load(float*, i64, i1, i32, i32) -define <9 x float> @strided_load_9x1(<9 x float>* %in, i64 %stride) { +define <9 x float> @strided_load_9x1(float* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_9x1( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x float>* [[IN:%.*]] to float* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <9 x float>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <9 x float>, <9 x float>* [[VEC_CAST]], align 4 ; CHECK-NEXT: ret <9 x float> [[COL_LOAD]] ; entry: - %load = call <9 x float> @llvm.matrix.column.major.load(<9 x float>* %in, i64 %stride, i1 false, i32 9, i32 1) + %load = call <9 x float> @llvm.matrix.column.major.load(float* %in, i64 %stride, i1 false, i32 9, i32 1) ret <9 x float> %load } -declare <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>*, i64, i1, i32, i32) +declare <8 x float> @llvm.matrix.column.major.load.v8f32(float*, i64, i1, i32, i32) -define <8 x float> @strided_load_4x2(<8 x float>* %in, i64 %stride) { +define <8 x float> @strided_load_4x2(float* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_4x2( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <8 x float>* [[IN:%.*]] to float* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr float, float* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast float* [[VEC_GEP]] to <4 x float>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <4 x float>, <4 x float>* [[VEC_CAST]], align 4 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr float, float* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr float, float* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast float* [[VEC_GEP2]] to <4 x float>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <4 x float>, <4 x float>* [[VEC_CAST3]], align 4 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x float> [[COL_LOAD]], <4 x float> [[COL_LOAD4]], <8 x i32> ; CHECK-NEXT: ret <8 x float> [[TMP1]] ; entry: - %load = call <8 x float> @llvm.matrix.column.major.load.v8f32(<8 x float>* %in, i64 %stride, i1 false, i32 4, i32 2) + %load = call <8 x float> @llvm.matrix.column.major.load.v8f32(float* %in, i64 %stride, i1 false, i32 4, i32 2) ret <8 x float> %load } diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-load-i32.ll @@ -2,20 +2,19 @@ ; RUN: opt -lower-matrix-intrinsics -S < %s | FileCheck %s ; RUN: opt -passes='lower-matrix-intrinsics' -S < %s | FileCheck %s -define <9 x i32> @strided_load_3x3(<9 x i32>* %in, i64 %stride) { +define <9 x i32> @strided_load_3x3(i32* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_3x3( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x i32>* [[IN:%.*]] to i32* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <3 x i32>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST]], align 4 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast i32* [[VEC_GEP2]] to <3 x i32>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST3]], align 4 ; CHECK-NEXT: [[VEC_START5:%.*]] = mul i64 2, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START5]] +; CHECK-NEXT: [[VEC_GEP6:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START5]] ; CHECK-NEXT: [[VEC_CAST7:%.*]] = bitcast i32* [[VEC_GEP6]] to <3 x i32>* ; CHECK-NEXT: [[COL_LOAD8:%.*]] = load <3 x i32>, <3 x i32>* [[VEC_CAST7]], align 4 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <3 x i32> [[COL_LOAD]], <3 x i32> [[COL_LOAD4]], <6 x i32> @@ -24,45 +23,43 @@ ; CHECK-NEXT: ret <9 x i32> [[TMP3]] ; entry: - %load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 3, i32 3) + %load = call <9 x i32> @llvm.matrix.column.major.load(i32* %in, i64 %stride, i1 false, i32 3, i32 3) ret <9 x i32> %load } -declare <9 x i32> @llvm.matrix.column.major.load(<9 x i32>*, i64, i1, i32, i32) +declare <9 x i32> @llvm.matrix.column.major.load(i32*, i64, i1, i32, i32) -define <9 x i32> @strided_load_9x1(<9 x i32>* %in, i64 %stride) { +define <9 x i32> @strided_load_9x1(i32* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_9x1( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <9 x i32>* [[IN:%.*]] to i32* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <9 x i32>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <9 x i32>, <9 x i32>* [[VEC_CAST]], align 4 ; CHECK-NEXT: ret <9 x i32> [[COL_LOAD]] ; entry: - %load = call <9 x i32> @llvm.matrix.column.major.load(<9 x i32>* %in, i64 %stride, i1 false, i32 9, i32 1) + %load = call <9 x i32> @llvm.matrix.column.major.load(i32* %in, i64 %stride, i1 false, i32 9, i32 1) ret <9 x i32> %load } -declare <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>*, i64, i1, i32, i32) +declare <8 x i32> @llvm.matrix.column.major.load.v8i32(i32*, i64, i1, i32, i32) -define <8 x i32> @strided_load_4x2(<8 x i32>* %in, i64 %stride) { +define <8 x i32> @strided_load_4x2(i32* %in, i64 %stride) { ; CHECK-LABEL: @strided_load_4x2( ; CHECK-NEXT: entry: -; CHECK-NEXT: [[TMP0:%.*]] = bitcast <8 x i32>* [[IN:%.*]] to i32* ; CHECK-NEXT: [[VEC_START:%.*]] = mul i64 0, [[STRIDE:%.*]] -; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START]] +; CHECK-NEXT: [[VEC_GEP:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START]] ; CHECK-NEXT: [[VEC_CAST:%.*]] = bitcast i32* [[VEC_GEP]] to <4 x i32>* ; CHECK-NEXT: [[COL_LOAD:%.*]] = load <4 x i32>, <4 x i32>* [[VEC_CAST]], align 4 ; CHECK-NEXT: [[VEC_START1:%.*]] = mul i64 1, [[STRIDE]] -; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr i32, i32* [[TMP0]], i64 [[VEC_START1]] +; CHECK-NEXT: [[VEC_GEP2:%.*]] = getelementptr i32, i32* %in, i64 [[VEC_START1]] ; CHECK-NEXT: [[VEC_CAST3:%.*]] = bitcast i32* [[VEC_GEP2]] to <4 x i32>* ; CHECK-NEXT: [[COL_LOAD4:%.*]] = load <4 x i32>, <4 x i32>* [[VEC_CAST3]], align 4 ; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[COL_LOAD]], <4 x i32> [[COL_LOAD4]], <8 x i32> ; CHECK-NEXT: ret <8 x i32> [[TMP1]] ; entry: - %load = call <8 x i32> @llvm.matrix.column.major.load.v8i32(<8 x i32>* %in, i64 %stride, i1 false, i32 4, i32 2) + %load = call <8 x i32> @llvm.matrix.column.major.load.v8i32(i32* %in, i64 %stride, i1 false, i32 4, i32 2) ret <8 x i32> %load } diff --git a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll --- a/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll +++ b/llvm/test/Transforms/LowerMatrixIntrinsics/strided-store-double.ll @@ -13,7 +13,7 @@ ; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST2]], align 8 ; CHECK-NEXT: ret void ; - call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 5, i1 false, i32 3, i32 2) + call void @llvm.matrix.column.major.store.v6f64(<6 x double> %in, double* %out, i64 5, i1 false, i32 3, i32 2) ret void } @@ -31,13 +31,10 @@ ; CHECK-NEXT: store <3 x double> [[SPLIT1]], <3 x double>* [[VEC_CAST4]], align 8 ; CHECK-NEXT: ret void ; - call void @llvm.matrix.column.major.store(<6 x double> %in, double* %out, i64 %stride, i1 false, i32 3, i32 2) + call void @llvm.matrix.column.major.store.v6f64(<6 x double> %in, double* %out, i64 %stride, i1 false, i32 3, i32 2) ret void } - -declare void @llvm.matrix.column.major.store(<6 x double>, double*, i64, i1, i32, i32) - define void @strided_store_2x3(<10 x double> %in, double* %out) { ; CHECK-LABEL: @strided_store_2x3( ; CHECK-NEXT: [[SPLIT:%.*]] = shufflevector <10 x double> [[IN:%.*]], <10 x double> undef, <2 x i32> @@ -65,10 +62,9 @@ ret void } +declare void @llvm.matrix.column.major.store.v6f64(<6 x double>, double*, i64, i1, i32, i32) declare void @llvm.matrix.column.major.store.v10f64(<10 x double>, double*, i64, i1, i32, i32) -; CHECK: declare void @llvm.matrix.column.major.store.v6f64.p0f64(<6 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY:#[0-9]]] - -; CHECK: declare void @llvm.matrix.column.major.store.v10f64.p0f64(<10 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) [[WRITEONLY]] - -; CHECK: attributes [[WRITEONLY]] = { argmemonly nosync nounwind willreturn writeonly } +; CHECK: declare void @llvm.matrix.column.major.store.v6f64(<6 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) #0 +; CHECK: declare void @llvm.matrix.column.major.store.v10f64(<10 x double>, double* nocapture writeonly, i64, i1 immarg, i32 immarg, i32 immarg) #0 +; CHECK: attributes #0 = { argmemonly nosync nounwind willreturn writeonly } diff --git a/llvm/test/Verifier/matrix-intrinsics.ll b/llvm/test/Verifier/matrix-intrinsics.ll --- a/llvm/test/Verifier/matrix-intrinsics.ll +++ b/llvm/test/Verifier/matrix-intrinsics.ll @@ -1,11 +1,10 @@ ; RUN: not llvm-as < %s -o /dev/null 2>&1 | FileCheck %s -declare <4 x float> @llvm.matrix.transpose.v4f32(<4 x float>, i32, i32) define <4 x float> @transpose(<4 x float> %m, i32 %arg) { ; CHECK: assembly parsed, but does not verify as correct! -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! ; CHECK-NEXT: immarg operand has non-immediate parameter ; CHECK-NEXT: i32 %arg ; CHECK-NEXT: %result.3 = call <4 x float> @llvm.matrix.transpose.v4f32(<4 x float> %result.2, i32 %arg, i32 2) @@ -20,11 +19,10 @@ ret <4 x float> %result.4 } -declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32) define <4 x float> @multiply(<4 x float> %m, i32 %arg) { -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! ; CHECK-NEXT: immarg operand has non-immediate parameter ; CHECK-NEXT: i32 %arg ; CHECK-NEXT: %result.3 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %result.2, <4 x float> %m, i32 %arg, i32 2, i32 1) @@ -35,32 +33,130 @@ ret <4 x float> %result.3 } -declare <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>*, i64, i1, i32, i32) -declare <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>*, i64, i1, i32, i32) -define <4 x float> @column.major_load(<4 x float>* %m, <6 x float>* %n, i32 %arg) { -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector +define <4 x float> @column.major_load(float* %m, float* %n, i32 %arg) { +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! ; CHECK-NEXT: immarg operand has non-immediate parameter ; CHECK-NEXT: i32 %arg -; CHECK-NEXT: %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 %arg) - %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>* %m, i64 0, i1 false, i32 0, i32 0) - %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32.p0v4f32(<4 x float>* %m, i64 2, i1 false, i32 1, i32 2) - %result.2 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 3) - %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32.p0v6f32(<6 x float>* %n, i64 2, i1 true, i32 3, i32 %arg) +; CHECK-NEXT: %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 %arg) + %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 0, i1 false, i32 0, i32 0) + %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 2, i1 false, i32 1, i32 2) + %result.2 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 3) + %result.3 = call <6 x float> @llvm.matrix.column.major.load.v6f32(float* %n, i64 2, i1 true, i32 3, i32 %arg) ret <4 x float> %result.1 } -declare void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i64, i1, i32, i32) -declare void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float>, <6 x float>*, i64, i1, i32, i32) -define void @column.major_store(<4 x float>* %m, <6 x float>* %n, i64 %arg) { -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector -; CHECK-NEXT: result of a matrix operation does not fit in the returned vector - call void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i64 0, i1 false, i32 0, i32 0) - call void @llvm.matrix.column.major.store.v4f32.p0v4f32(<4 x float> zeroinitializer, <4 x float>* %m, i64 2, i1 false, i32 1, i32 2) - call void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i64 2, i1 false, i32 3, i32 3) - call void @llvm.matrix.column.major.store.v6f32.p0v6f32(<6 x float> zeroinitializer, <6 x float>* %n, i64 %arg, i1 false, i32 3, i32 3) +define void @column.major_store(float* %m, float* %n, i64 %arg) { +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! +; CHECK-NEXT: Result of a matrix operation does not fit in the returned vector! + call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 0, i1 false, i32 0, i32 0) + call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 2, i1 false, i32 1, i32 2) + call void @llvm.matrix.column.major.store.v6f32(<6 x float> zeroinitializer, float* %n, i64 2, i1 false, i32 3, i32 3) + call void @llvm.matrix.column.major.store.v6f32(<6 x float> zeroinitializer, float* %n, i64 %arg, i1 false, i32 3, i32 3) + ret void +} + +define <4 x float> @transpose_mixed_types(<4 x float> %fvec, <4 x i32> %ivec, i32 %arg) { +; +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: <4 x float> (<4 x i32>, i32, i32)* @llvm.matrix.transpose.v4f32.v4i32 +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: <4 x i32> (<4 x float>, i32, i32)* @llvm.matrix.transpose.v4i32.v4f32 +; + %result.0 = call <4 x float> @llvm.matrix.transpose.v4f32.v4i32(<4 x i32> %ivec, i32 0, i32 0) + %result.1 = call <4 x i32> @llvm.matrix.transpose.v4i32.v4f32(<4 x float> %result.0, i32 3, i32 2) + ret <4 x float> %result.0 +} + +define <4 x float> @multiply_mixed_types(<4 x i32> %ivec, <4 x float> %fvec, i32 %arg) { +; +; CHECK-NEXT: Vector element type mismatch of the result and first operand vector! +; CHECK-NEXT: <4 x i32> (<4 x float>, <4 x float>, i32, i32, i32)* @llvm.matrix.multiply.v4i32.v4f32.v4f32 +; CHECK-NEXT: Vector element type mismatch of the result and first operand vector! +; CHECK-NEXT: <4 x float> (<4 x i32>, <4 x float>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4i32.v4f32 +; CHECK-NEXT: Vector element type mismatch of the result and second operand vector! +; CHECK-NEXT: <4 x float> (<4 x float>, <4 x i32>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4f32.v4i32 +; CHECK-NEXT: Vector element type mismatch of the result and first operand vector! +; CHECK-NEXT: <4 x float> (<4 x i32>, <4 x i32>, i32, i32, i32)* @llvm.matrix.multiply.v4f32.v4i32.v4i32 +; + %result.0 = call <4 x i32> @llvm.matrix.multiply.v4i32.v4f32.v4f32(<4 x float> %fvec, <4 x float> %fvec, i32 2, i32 2, i32 2) + %result.1 = call <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4f32(<4 x i32> %result.0, <4 x float> %fvec, i32 2, i32 2, i32 2) + %result.2 = call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4i32(<4 x float> %fvec, <4 x i32> %ivec, i32 2, i32 2, i32 2) + %result.3 = call <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4i32(<4 x i32> %ivec, <4 x i32> %ivec, i32 2, i32 2, i32 2) + ret <4 x float> %result.3 +} + +define <4 x float> @column.major_load_mixed_types(i32* %m, float* %n, i32 %arg) { +; +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: <4 x float> (i32*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4f32.pi32 +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: <4 x i32> (float*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4i32 +; + %result.0 = call <4 x float> @llvm.matrix.column.major.load.v4f32.pi32(i32* %m, i64 2, i1 false, i32 2, i32 2) + %result.1 = call <4 x i32> @llvm.matrix.column.major.load.v4i32(float* %n, i64 2, i1 false, i32 2, i32 2) + ret <4 x float> %result.0 +} + +define void @column.major_store_mixed_types(float* %m, i32* %n, i64 %arg) { +; +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: void (<4 x i32>, float*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4i32.vi32 +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: void (<4 x float>, i32*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32.pi32 +; + call void @llvm.matrix.column.major.store.v4i32.vi32(<4 x i32> zeroinitializer, float* %m, i64 2, i1 false, i32 2, i32 2) + call void @llvm.matrix.column.major.store.v4f32.pi32(<4 x float> zeroinitializer, i32* %n, i64 2, i1 false, i32 2, i32 2) ret void } + +define void @column.major_store_non_int_float_type(<4 x float>* %m, <4 x float>* %n, i64 %arg) { +; +; CHECK-NEXT: Intrinsic has incorrect argument type! +; CHECK-NEXT: void (<4 x float*>, <4 x float>*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32p0.p0v4f32 +; + call void @llvm.matrix.column.major.store.v4f32p0.p0v4f32(<4 x float*> zeroinitializer, <4 x float>* %n, i64 2, i1 false, i32 2, i32 2) + ret void +} + +define <4 x float> @column.major_load_stride_too_small(float* %m, i32 %arg) { +; +; CHECK-NEXT: Stride must be greater or equal than the number of rows! +; CHECK-NEXT: <4 x float> (float*, i64, i1, i32, i32)* @llvm.matrix.column.major.load.v4f32 +; + %result.1 = call <4 x float> @llvm.matrix.column.major.load.v4f32(float* %m, i64 1, i1 false, i32 2, i32 2) + ret <4 x float> %result.1 +} + +define void @column.major_store_stride_too_small(float* %m, i64 %arg) { +; +; CHECK-NEXT: Stride must be greater or equal than the number of rows! +; CHECK-NEXT: void (<4 x float>, float*, i64, i1, i32, i32)* @llvm.matrix.column.major.store.v4f32 +; + call void @llvm.matrix.column.major.store.v4f32(<4 x float> zeroinitializer, float* %m, i64 1, i1 false, i32 2, i32 2) + ret void +} + +declare <4 x i32> @llvm.matrix.column.major.load.v4i32(float*, i64, i1, i32, i32) +declare <4 x float> @llvm.matrix.column.major.load.v4f32.pi32(i32*, i64, i1, i32, i32) +declare <4 x float> @llvm.matrix.column.major.load.v4f32(float*, i64, i1, i32, i32) +declare <6 x float> @llvm.matrix.column.major.load.v6f32(float*, i64, i1, i32, i32) + +declare void @llvm.matrix.column.major.store.v4f32(<4 x float>, float*, i64, i1, i32, i32) +declare void @llvm.matrix.column.major.store.v6f32(<6 x float>, float*, i64, i1, i32, i32) +declare void @llvm.matrix.column.major.store.v4i32.vi32(<4 x i32>, float*, i64, i1, i32, i32) +declare void @llvm.matrix.column.major.store.v4f32.pi32(<4 x float>, i32*, i64, i1, i32, i32) +declare void @llvm.matrix.column.major.store.v4f32p0.p0v4f32(<4 x float*>, <4 x float>*, i64, i1, i32, i32) + +declare <4 x i32> @llvm.matrix.transpose.v4i32.v4f32(<4 x float>, i32, i32) +declare <4 x float> @llvm.matrix.transpose.v4f32(<4 x float>, i32, i32) +declare <4 x float> @llvm.matrix.transpose.v4f32.v4i32(<4 x i32>, i32, i32) + +declare <4 x i32> @llvm.matrix.multiply.v4i32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32) +declare <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4f32(<4 x i32>, <4 x float>, i32, i32, i32) +declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4i32(<4 x float>, <4 x i32>, i32, i32, i32) +declare <4 x float> @llvm.matrix.multiply.v4f32.v4i32.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32) +declare <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float>, <4 x float>, i32, i32, i32)