This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
docs/
-
MatrixTypes.rst
-
include/clang/
-
clang/
-
Basic/
-
Builtins.def
-
DiagnosticSemaKinds.td
-
Sema/
-
Sema.h
-
lib/
-
CodeGen/
-
CGBuiltin.cpp
-
Sema/
-
SemaChecking.cpp
-
test/
-
CodeGen/
-
matrix-type-builtins.c
-
Sema/
-
matrix-type-builtins.c
-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
Intrinsics.td
1
MatrixBuilder.h
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
LowerMatrixIntrinsics.cpp

Differential D99433

[Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.
Needs RevisionPublic

Authored by everton.constantino on Mar 26 2021, 12:00 PM.

Download Raw Diff

Details

Reviewers

anemet
rjmccall
rsmith
Bigcheese
fhahn

Summary

This patch creates a new builtin to support matrix multiply add. Currently when you do C = A*B + C you have the overhead of additional fadds. With this
builtin the accumulatores are loaded with the C matrix during the multiplication considerably reducing the ammount of operations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

everton.constantino created this revision.Mar 26 2021, 12:00 PM

Herald added subscribers: dexonsmith, tschuett, hiraditya. · View Herald TranscriptMar 26 2021, 12:00 PM

everton.constantino requested review of this revision.Mar 26 2021, 12:00 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 26 2021, 12:00 PM

Herald added subscribers: llvm-commits, cfe-commits, jdoerfert. · View Herald Transcript

[Drive by] LLVM test missing?

@jdoerfert Which tests do you have in mind? I added one for SEMA and one for CodeGen.

Thanks for putting up the patch!

Do you think it would be possible to get the desired behavior without a new builtin? We should be able to combine the add with the initial multiply for each vector, as long as we have the right fast-math flags? IIUC reassociate should be enough. So perhaps it would be possible to perform this optimization in LowerMatrixIntrinsics directly. The user should then be able to use to enable the right fast-math flags locally using pragma clang fp, like below. Clang first needs to be updated to handle those pragmas properly for the matrix types.

#pragma clang fp reassociate(on)
C = A*B + C;

@fhahn That was my first idea however its not as simple as it looks. I tried moving the adds but splats make it considerably harder to find a pattern that catches this and fuses the multiplies specially with bigger matrices. My real wish was to actually add a new IR instruction to handle matrices because the MADD is but a simple example of other more interesting optimizations that can be done, like using matrix associative properties to reduce the number of calculations. I found that path too complicated however and I opted for a compromise at the moment. I wish to start writing some GEMM micro-kernels with this extension and this builtin was the shortest path.

In D99433#2653528, @everton.constantino wrote:

@jdoerfert Which tests do you have in mind? I added one for SEMA and one for CodeGen.

Tests for everything you placed in llvm/. Your tests are all in clang/.

llvm/include/llvm/IR/MatrixBuilder.h
152	This code is not tested, as far as I can tell. Or is it?

Harbormaster completed remote builds in B95914: Diff 333605.Mar 26 2021, 1:01 PM

In D99433#2653586, @everton.constantino wrote:

@fhahn That was my first idea however its not as simple as it looks. I tried moving the adds but splats make it considerably harder to find a pattern that catches this and fuses the multiplies specially with bigger matrices. My real wish was to actually add a new IR instruction to handle matrices because the MADD is but a simple example of other more interesting optimizations that can be done, like using matrix associative properties to reduce the number of calculations. I found that path too complicated however and I opted for a compromise at the moment. I wish to start writing some GEMM micro-kernels with this extension and this builtin was the shortest path.

Could you elaborate on the splats that make this tricky? Before the matrix lowering, there should be no splats: https://godbolt.org/z/r941xsc6b. I was thinking of detecting the multiply/add before we do the actual lowering, e.g. like it is already done for {load, load} ->multiply->store chains in LowerMatrixMultiplyFused https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L1346

Before the matrix lowering, there should be no splats: https://godbolt.org/z/r941xsc6b

It might still be convenient to have a separate multiply-add intrinsic for matrixes, because then we could just replace fadd( @matrix.multiply() , X) before lowering. But I am not sure how scalable this will be (I don't think we want too many intrinsics), so perhaps we could keep track of bundles of instructions to lower together in general. But I don't think we need this for the initial optimization to start with.

FYI I filed https://bugs.llvm.org/show_bug.cgi?id=49738 and https://bugs.llvm.org/show_bug.cgi?id=49739 for improving fast-math handling in the lowering pass and support for pargma fp, in case you are interested.

@everton.constantino out of curiosity, what architecture are you focused on and what matrix sizes? I have a few performance improvements lined up for code-gen. It might be good to sync up to make sure there's no duplicated work.

tellenbach added a subscriber: tellenbach.Mar 29 2021, 2:44 PM

@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:

%vec.cast = bitcast [4 x float]* %A to <2 x float>*
%col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
%vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
%vec.cast2 = bitcast float* %vec.gep to <2 x float>*
%col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
%vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
%col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
%vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
%vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
%col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
%splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x i32> zeroinitializer
%0 = fmul <2 x float> %col.load, %splat.splat
%splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat11, <2 x float> %0)
%splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 x i32> zeroinitializer
%2 = fmul <2 x float> %col.load, %splat.splat14
%splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> %splat.splat17, <2 x float> %2)
%vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
%col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
%vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
%vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
%col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
%4 = fadd <2 x float> %1, %col.load19
%5 = fadd <2 x float> %3, %col.load22
store <2 x float> %4, <2 x float>* %vec.cast18, align 4
store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for example, and this is what I meant by the splat in the middle. The pragma approach assumes that we´re always working with architectures that the better approach is to fuse the fmul and fadds. The problem here is what you have to decide is between preloading the accumulator or not. On IBM Power10´s MMA this would be pretty far from optimal, for example, because you have specific instructions to load accumulators.

In D99433#2661357, @everton.constantino wrote:

@fhahn When I mentioned the splats I was talking about the IR, not the final code. On the Godbolts links you sent, its the same that I see. However take a look into the IR your example generates:

Sorry for not being clearer. I meant the IR *before* LowerMatrixIntrinisics is run (which should be on the righthand side of the Godbolt view). I'm also posting it below. Unless I am missing something, we should be able to easily match fadd (llvm.matrix.multiply(A, B), C) before the actual lowering of llvm.matrix.multiply. I think we do something similar already for combing load->multiply->store chains: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp#L703 . Basically try to fuse all multiplies before the 'normal' lowering. Would it be possible to deal with fadd (llvm.matrix.multiply(A, B), C) similarly?

lang-13: warning: argument unused during compilation: '--gcc-toolchain=/opt/compiler-explorer/gcc-snapshot' [-Wunused-command-line-argument]
*** IR Dump Before Lower the matrix intrinsics (lower-matrix-intrinsics) ***
; Function Attrs: nofree nounwind uwtable willreturn mustprogress
define dso_local void @_Z3fooRu11matrix_typeILm2ELm2EfES0_S0_([4 x float]* nocapture nonnull readonly align 4 dereferenceable(16) %0, [4 x float]* nocapture nonnull align 4 dereferenceable(16) %1, [4 x float]* nocapture nonnull readonly align 4 dereferenceable(16) %2) local_unnamed_addr #0 {
  %4 = bitcast [4 x float]* %0 to <4 x float>*
  %5 = load <4 x float>, <4 x float>* %4, align 4, !tbaa !6
  %6 = bitcast [4 x float]* %2 to <4 x float>*
  %7 = load <4 x float>, <4 x float>* %6, align 4, !tbaa !6
  %8 = tail call <4 x float> @llvm.matrix.multiply.v4f32.v4f32.v4f32(<4 x float> %5, <4 x float> %7, i32 2, i32 2, i32 2)
  %9 = bitcast [4 x float]* %1 to <4 x float>*
  %10 = load <4 x float>, <4 x float>* %9, align 4, !tbaa !6
  %11 = fadd <4 x float> %8, %10
  store <4 x float> %11, <4 x float>* %9, align 4, !tbaa !6
  ret void
}

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

Just to see if I understand correctly I can add a matrix_add intrinsic, do a travesal looking for matrix_multiply and fuse both changing LowerMatrixMultiplyFused to support pre-loading the accumulator. Is that correct?

In D99433#2661919, @everton.constantino wrote:

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

Just to see if I understand correctly I can add a matrix_add intrinsic, do a travesal looking for matrix_multiply and fuse both changing LowerMatrixMultiplyFused to support pre-loading the accumulator. Is that correct?

Yes that sounds like a good path forward! I think at the moment, adding a matrix_mul_add intrinsic may be a bit premature, as we can just match & lower directly in place, as we already do in LowerMatrixMultiplyFused. Once we add more and more such transforms, it may really help to have additional intrinsics (or we could just create our own dummy declarations which are just used during the matrix lowering, to avoid adding too many intrinsics). But for now I think can move along faster without adding a new intrinsic.

In D99433#2662259, @fhahn wrote:

In D99433#2661919, @everton.constantino wrote:

@fhahn Ok I see what you mean now, this sounds like a doable path and might be able to cover architectures with specialized matrix multiplication instructions as well .

Just to see if I understand correctly I can add a matrix_add intrinsic, do a travesal looking for matrix_multiply and fuse both changing LowerMatrixMultiplyFused to support pre-loading the accumulator. Is that correct?

Yes that sounds like a good path forward! I think at the moment, adding a matrix_mul_add intrinsic may be a bit premature, as we can just match & lower directly in place, as we already do in LowerMatrixMultiplyFused. Once we add more and more such transforms, it may really help to have additional intrinsics (or we could just create our own dummy declarations which are just used during the matrix lowering, to avoid adding too many intrinsics). But for now I think can move along faster without adding a new intrinsic.

Great, Ill update the patch then. Thanks for the comments!

In D99433#2662275, @everton.constantino wrote:

Great, Ill update the patch then. Thanks for the comments!

Sounds good to me, thanks! Marking as changes requested until then, to remove it from the review queue.

This revision now requires changes to proceed.Apr 19 2021, 8:46 AM

Just FYI, #pragma clang fp support for matrix operations has been added in be2277fbf233 by @effective-light in the meantime.

Revision Contents

Path

Size

clang/

docs/

MatrixTypes.rst

17 lines

include/

clang/

Basic/

Builtins.def

1 line

DiagnosticSemaKinds.td

4 lines

Sema/

Sema.h

2 lines

lib/

CodeGen/

CGBuiltin.cpp

17 lines

Sema/

SemaChecking.cpp

75 lines

test/

CodeGen/

matrix-type-builtins.c

25 lines

Sema/

matrix-type-builtins.c

11 lines

llvm/

include/

llvm/

IR/

Intrinsics.td

7 lines

MatrixBuilder.h

25 lines

lib/

Transforms/

Scalar/

LowerMatrixIntrinsics.cpp

95 lines

Diff 333605

clang/docs/MatrixTypes.rst

	Show First 20 Lines • Show All 198 Lines • ▼ Show 20 Lines
	[library.description.structure.specifications]/3 in the C++ standard.			[library.description.structure.specifications]/3 in the C++ standard.

	Definitions:			Definitions:

	* M, M1, M2, M3 - Matrix types			* M, M1, M2, M3 - Matrix types
	* T - Element type			* T - Element type
	* row, col - Row and column arguments respectively.			* row, col - Row and column arguments respectively.

				``M3 __builtin_matrix_multiply_add(M1 matrixA, M2 matrixB, M3 matrixC)``

				Returns: A matrix ``Res`` equivalent to the code below, where ``row`` refers to the
				number of rows of ``M1``, ``depth`` to the number of either columns of ``M1`` or rows of ``M2`` and
				``col`` to the number of columns of ``M2``.

				Effects: Equivalent to:

				.. code-block:: c++

				M Res;
				for (int C = 0; C < col; ++C)
				for (int R = 0; R < row; ++R)
				Acc = matrixC[R][C];
				for (int K = 0; K < depth; ++K)
				Acc += matrix[R][C];
				Res[R][C] = Acc

	``M2 __builtin_matrix_transpose(M1 matrix)``			``M2 __builtin_matrix_transpose(M1 matrix)``

	Remarks: The return type is a cv-unqualified matrix type that has the same			Remarks: The return type is a cv-unqualified matrix type that has the same
	element type as ``M1`` and has the the same number of rows as ``M1`` has columns and			element type as ``M1`` and has the the same number of rows as ``M1`` has columns and
	the same number of columns as ``M1`` has rows.			the same number of columns as ``M1`` has rows.

	Returns: A matrix ``Res`` equivalent to the code below, where ``col`` refers to the			Returns: A matrix ``Res`` equivalent to the code below, where ``col`` refers to the
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

clang/include/clang/Basic/Builtins.def

	Show First 20 Lines • Show All 636 Lines • ▼ Show 20 Lines
	BUILTIN(__builtin_convertvector, "v." , "nct")			BUILTIN(__builtin_convertvector, "v." , "nct")
	BUILTIN(__builtin_alloca, "v*z" , "Fn")			BUILTIN(__builtin_alloca, "v*z" , "Fn")
	BUILTIN(__builtin_alloca_with_align, "v*zIz", "Fn")			BUILTIN(__builtin_alloca_with_align, "v*zIz", "Fn")
	BUILTIN(__builtin_call_with_static_chain, "v.", "nt")			BUILTIN(__builtin_call_with_static_chain, "v.", "nt")

	BUILTIN(__builtin_matrix_transpose, "v.", "nFt")			BUILTIN(__builtin_matrix_transpose, "v.", "nFt")
	BUILTIN(__builtin_matrix_column_major_load, "v.", "nFt")			BUILTIN(__builtin_matrix_column_major_load, "v.", "nFt")
	BUILTIN(__builtin_matrix_column_major_store, "v.", "nFt")			BUILTIN(__builtin_matrix_column_major_store, "v.", "nFt")
				BUILTIN(__builtin_matrix_multiply_add, "v.", "nFt")

	// "Overloaded" Atomic operator builtins. These are overloaded to support data			// "Overloaded" Atomic operator builtins. These are overloaded to support data
	// types of i8, i16, i32, i64, and i128. The front-end sees calls to the			// types of i8, i16, i32, i64, and i128. The front-end sees calls to the
	// non-suffixed version of these (which has a bogus type) and transforms them to			// non-suffixed version of these (which has a bogus type) and transforms them to
	// the right overloaded version in Sema (plus casts).			// the right overloaded version in Sema (plus casts).

	// FIXME: These assume that char -> i8, short -> i16, int -> i32,			// FIXME: These assume that char -> i8, short -> i16, int -> i32,
	// long long -> i64.			// long long -> i64.
	▲ Show 20 Lines • Show All 1,004 Lines • Show Last 20 Lines

clang/include/clang/Basic/DiagnosticSemaKinds.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,101 Lines • ▼ Show 20 Lines	def err_matrix_index_outside_range: Error<
"matrix %select{row\|column}0 index is outside the allowed range [0, %1)">;		"matrix %select{row\|column}0 index is outside the allowed range [0, %1)">;
def err_matrix_incomplete_index: Error<		def err_matrix_incomplete_index: Error<
"single subscript expressions are not allowed for matrix values">;		"single subscript expressions are not allowed for matrix values">;
def err_matrix_separate_incomplete_index: Error<		def err_matrix_separate_incomplete_index: Error<
"matrix row and column subscripts cannot be separated by any expression">;		"matrix row and column subscripts cannot be separated by any expression">;
def err_matrix_subscript_comma: Error<		def err_matrix_subscript_comma: Error<
"comma expressions are not allowed as indices in matrix subscript expressions">;		"comma expressions are not allowed as indices in matrix subscript expressions">;
def err_builtin_matrix_arg: Error<"1st argument must be a matrix">;		def err_builtin_matrix_arg: Error<"1st argument must be a matrix">;
		def err_builtin_matrix_dimension_mismatch: Error<
		"The number of columns of the 1st argument must be the same as the number of rows of the 2nd argument and the number of rows of the 1st argument and columns of the 2nd argument must match 3rd argument">;
		def err_builtin_matrix_scalar_type: Error<
		"All arguments elements type must match">;
def err_builtin_matrix_scalar_unsigned_arg: Error<		def err_builtin_matrix_scalar_unsigned_arg: Error<
"%0 argument must be a constant unsigned integer expression">;		"%0 argument must be a constant unsigned integer expression">;
def err_builtin_matrix_pointer_arg: Error<		def err_builtin_matrix_pointer_arg: Error<
"%ordinal0 argument must be a pointer to a valid matrix element type">;		"%ordinal0 argument must be a pointer to a valid matrix element type">;
def err_builtin_matrix_pointer_arg_mismatch: Error<		def err_builtin_matrix_pointer_arg_mismatch: Error<
"the pointee of the 2nd argument must match the element type of the 1st argument (%0 != %1)">;		"the pointee of the 2nd argument must match the element type of the 1st argument (%0 != %1)">;
def err_builtin_matrix_store_to_const: Error<		def err_builtin_matrix_store_to_const: Error<
"cannot store matrix to read-only pointer">;		"cannot store matrix to read-only pointer">;
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,508 Lines • ▼ Show 20 Lines	private:

// Matrix builtin handling.		// Matrix builtin handling.
ExprResult SemaBuiltinMatrixTranspose(CallExpr *TheCall,		ExprResult SemaBuiltinMatrixTranspose(CallExpr *TheCall,
ExprResult CallResult);		ExprResult CallResult);
ExprResult SemaBuiltinMatrixColumnMajorLoad(CallExpr *TheCall,		ExprResult SemaBuiltinMatrixColumnMajorLoad(CallExpr *TheCall,
ExprResult CallResult);		ExprResult CallResult);
ExprResult SemaBuiltinMatrixColumnMajorStore(CallExpr *TheCall,		ExprResult SemaBuiltinMatrixColumnMajorStore(CallExpr *TheCall,
ExprResult CallResult);		ExprResult CallResult);
		ExprResult SemaBuiltinMatrixMultiplyAdd(CallExpr *TheCall,
		ExprResult CallResult);

public:		public:
enum FormatStringType {		enum FormatStringType {
FST_Scanf,		FST_Scanf,
FST_Printf,		FST_Printf,
FST_NSString,		FST_NSString,
FST_Strftime,		FST_Strftime,
FST_Strfmon,		FST_Strfmon,
▲ Show 20 Lines • Show All 437 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,061 Lines • ▼ Show 20 Lines	case Builtin::BI__builtin_matrix_column_major_store: {
EmitNonNullArgCheck(RValue::get(Dst.getPointer()), E->getArg(1)->getType(),		EmitNonNullArgCheck(RValue::get(Dst.getPointer()), E->getArg(1)->getType(),
E->getArg(1)->getExprLoc(), FD, 0);		E->getArg(1)->getExprLoc(), FD, 0);
Value *Result = MB.CreateColumnMajorStore(		Value *Result = MB.CreateColumnMajorStore(
Matrix, Dst.getPointer(), Align(Dst.getAlignment().getQuantity()),		Matrix, Dst.getPointer(), Align(Dst.getAlignment().getQuantity()),
Stride, IsVolatile, MatrixTy->getNumRows(), MatrixTy->getNumColumns());		Stride, IsVolatile, MatrixTy->getNumRows(), MatrixTy->getNumColumns());
return RValue::get(Result);		return RValue::get(Result);
}		}

		case Builtin::BI__builtin_matrix_multiply_add: {
		MatrixBuilder<CGBuilderTy> MB(Builder);
		Value *MatrixA = EmitScalarExpr(E->getArg(0));
		Value *MatrixB = EmitScalarExpr(E->getArg(1));
		Value *MatrixC = EmitScalarExpr(E->getArg(2));

		const auto *MatrixTy1 =
		E->getArg(0)->getType()->getAs<ConstantMatrixType>();
		const auto *MatrixTy2 =
		E->getArg(1)->getType()->getAs<ConstantMatrixType>();

		Value *Result = MB.CreateMatrixMultiplyAdd(
		MatrixA, MatrixB, MatrixC, MatrixTy1->getNumRows(),
		MatrixTy1->getNumColumns(), MatrixTy2->getNumColumns());
		return RValue::get(Result);
		}

case Builtin::BIfinite:		case Builtin::BIfinite:
case Builtin::BI__finite:		case Builtin::BI__finite:
case Builtin::BIfinitef:		case Builtin::BIfinitef:
case Builtin::BI__finitef:		case Builtin::BI__finitef:
case Builtin::BIfinitel:		case Builtin::BIfinitel:
case Builtin::BI__finitel:		case Builtin::BI__finitel:
case Builtin::BI__builtin_isinf:		case Builtin::BI__builtin_isinf:
case Builtin::BI__builtin_isfinite: {		case Builtin::BI__builtin_isfinite: {
▲ Show 20 Lines • Show All 14,814 Lines • Show Last 20 Lines

clang/lib/Sema/SemaChecking.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,961 Lines • ▼ Show 20 Lines	case Builtin::BI__builtin_matrix_transpose:
return SemaBuiltinMatrixTranspose(TheCall, TheCallResult);		return SemaBuiltinMatrixTranspose(TheCall, TheCallResult);

case Builtin::BI__builtin_matrix_column_major_load:		case Builtin::BI__builtin_matrix_column_major_load:
return SemaBuiltinMatrixColumnMajorLoad(TheCall, TheCallResult);		return SemaBuiltinMatrixColumnMajorLoad(TheCall, TheCallResult);

case Builtin::BI__builtin_matrix_column_major_store:		case Builtin::BI__builtin_matrix_column_major_store:
return SemaBuiltinMatrixColumnMajorStore(TheCall, TheCallResult);		return SemaBuiltinMatrixColumnMajorStore(TheCall, TheCallResult);

		case Builtin::BI__builtin_matrix_multiply_add:
		return SemaBuiltinMatrixMultiplyAdd(TheCall, TheCallResult);

case Builtin::BI__builtin_get_device_side_mangled_name: {		case Builtin::BI__builtin_get_device_side_mangled_name: {
auto Check = [](CallExpr *TheCall) {		auto Check = [](CallExpr *TheCall) {
if (TheCall->getNumArgs() != 1)		if (TheCall->getNumArgs() != 1)
return false;		return false;
auto *DRE = dyn_cast<DeclRefExpr>(TheCall->getArg(0)->IgnoreImpCasts());		auto *DRE = dyn_cast<DeclRefExpr>(TheCall->getArg(0)->IgnoreImpCasts());
if (!DRE)		if (!DRE)
return false;		return false;
auto *D = DRE->getDecl();		auto *D = DRE->getDecl();
▲ Show 20 Lines • Show All 14,169 Lines • ▼ Show 20 Lines	ExprResult Sema::SemaBuiltinMatrixColumnMajorLoad(CallExpr *TheCall,
if (ArgError \|\| !MaybeRows \|\| !MaybeColumns)		if (ArgError \|\| !MaybeRows \|\| !MaybeColumns)
return ExprError();		return ExprError();

TheCall->setType(		TheCall->setType(
Context.getConstantMatrixType(ElementTy, MaybeRows, MaybeColumns));		Context.getConstantMatrixType(ElementTy, MaybeRows, MaybeColumns));
return CallResult;		return CallResult;
}		}

		ExprResult Sema::SemaBuiltinMatrixMultiplyAdd(CallExpr *TheCall,
		ExprResult CallResult) {
		if (!getLangOpts().MatrixTypes) {
		Diag(TheCall->getBeginLoc(), diag::err_builtin_matrix_disabled);
		return ExprError();
		}

		if (checkArgCount(*this, TheCall, 3))
		return ExprError();

		ExprResult MatrixAArg = DefaultLvalueConversion(TheCall->getArg(0));
		if (MatrixAArg.isInvalid())
		return MatrixAArg;
		Expr *MatrixA = MatrixAArg.get();

		auto *MTypeA = MatrixA->getType()->getAs<ConstantMatrixType>();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto MTypeA' can be declared as 'const auto MTypeA' [llvm-qualified-auto] not useful Lint: Pre-merge checks: clang-tidy: warning: 'auto MTypeA' can be declared as 'const auto MTypeA' [llvm-qualified…
		if (!MTypeA) {
		Diag(MatrixA->getBeginLoc(), diag::err_builtin_matrix_arg);
		return ExprError();
		}

		ExprResult MatrixBArg = DefaultLvalueConversion(TheCall->getArg(1));
		if (MatrixBArg.isInvalid())
		return MatrixBArg;
		Expr *MatrixB = MatrixBArg.get();

		auto *MTypeB = MatrixB->getType()->getAs<ConstantMatrixType>();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto MTypeB' can be declared as 'const auto MTypeB' [llvm-qualified-auto] not useful Lint: Pre-merge checks: clang-tidy: warning: 'auto MTypeB' can be declared as 'const auto MTypeB' [llvm-qualified…
		if (!MTypeB) {
		Diag(MatrixB->getBeginLoc(), diag::err_builtin_matrix_arg);
		return ExprError();
		}

		ExprResult MatrixCArg = DefaultLvalueConversion(TheCall->getArg(2));
		if (MatrixCArg.isInvalid())
		return MatrixCArg;
		Expr *MatrixC = MatrixCArg.get();

		auto *MTypeC = MatrixC->getType()->getAs<ConstantMatrixType>();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto MTypeC' can be declared as 'const auto MTypeC' [llvm-qualified-auto] not useful Lint: Pre-merge checks: clang-tidy: warning: 'auto MTypeC' can be declared as 'const auto MTypeC' [llvm-qualified…
		if (!MTypeC) {
		Diag(MatrixC->getBeginLoc(), diag::err_builtin_matrix_arg);
		return ExprError();
		}

		// Check wether all matrices have the same element type. We don't support
		// mixed precision as of yet.
		if (!(Context.hasSameType(MTypeC->getElementType(),
		MTypeA->getElementType()) &&
		Context.hasSameType(MTypeC->getElementType(),
		MTypeB->getElementType()))) {
		Diag(MatrixC->getBeginLoc(), diag::err_builtin_matrix_scalar_type);
		return ExprError();
		}

		// Check if dimensions are appropriate.
		if (MTypeA->getNumColumns() != MTypeB->getNumRows() \|\|
		!(MTypeC->getNumColumns() == MTypeB->getNumColumns() &&
		MTypeC->getNumRows() == MTypeA->getNumRows())) {
		Diag(MatrixC->getBeginLoc(), diag::err_builtin_matrix_dimension_mismatch);
		return ExprError();
		}

		// Prepare Result matrix.
		QualType ResultType = Context.getConstantMatrixType(
		MTypeC->getElementType(), MTypeC->getNumRows(), MTypeC->getNumColumns());

		TheCall->setType(ResultType);
		TheCall->setArg(0, MatrixA);
		TheCall->setArg(1, MatrixB);
		TheCall->setArg(2, MatrixC);
		return CallResult;
		}

ExprResult Sema::SemaBuiltinMatrixColumnMajorStore(CallExpr *TheCall,		ExprResult Sema::SemaBuiltinMatrixColumnMajorStore(CallExpr *TheCall,
ExprResult CallResult) {		ExprResult CallResult) {
if (checkArgCount(*this, TheCall, 3))		if (checkArgCount(*this, TheCall, 3))
return ExprError();		return ExprError();

unsigned PtrArgIdx = 1;		unsigned PtrArgIdx = 1;
Expr *MatrixExpr = TheCall->getArg(0);		Expr *MatrixExpr = TheCall->getArg(0);
Expr *PtrExpr = TheCall->getArg(PtrArgIdx);		Expr *PtrExpr = TheCall->getArg(PtrArgIdx);
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

clang/test/CodeGen/matrix-type-builtins.c

	// RUN: %clang_cc1 -fenable-matrix -triple x86_64-apple-darwin %s -emit-llvm -disable-llvm-passes -o - \| FileCheck %s			// RUN: %clang_cc1 -fenable-matrix -triple x86_64-apple-darwin %s -emit-llvm -disable-llvm-passes -o - \| FileCheck %s

	// Also check we do not crash when running some middle-end passes. Most			// Also check we do not crash when running some middle-end passes. Most
	// importantly this includes the IR verifier, to ensure we emit valid IR.			// importantly this includes the IR verifier, to ensure we emit valid IR.
	// RUN: %clang_cc1 -fenable-matrix -emit-llvm -triple x86_64-apple-darwin %s -o %t			// RUN: %clang_cc1 -fenable-matrix -emit-llvm -triple x86_64-apple-darwin %s -o %t

	// Tests for the matrix type builtins.			// Tests for the matrix type builtins.

	typedef double dx5x5_t __attribute__((matrix_type(5, 5)));			typedef double dx5x5_t __attribute__((matrix_type(5, 5)));
	typedef float fx2x3_t __attribute__((matrix_type(2, 3)));			typedef float fx2x3_t __attribute__((matrix_type(2, 3)));
	typedef float fx3x2_t __attribute__((matrix_type(3, 2)));			typedef float fx3x2_t __attribute__((matrix_type(3, 2)));
				typedef float fx2x2_t __attribute__((matrix_type(5, 5)));
	typedef int ix20x4_t __attribute__((matrix_type(20, 4)));			typedef int ix20x4_t __attribute__((matrix_type(20, 4)));
	typedef int ix4x20_t __attribute__((matrix_type(4, 20)));			typedef int ix4x20_t __attribute__((matrix_type(4, 20)));
	typedef unsigned ux1x6_t __attribute__((matrix_type(1, 6)));			typedef unsigned ux1x6_t __attribute__((matrix_type(1, 6)));
	typedef unsigned ux6x1_t __attribute__((matrix_type(6, 1)));			typedef unsigned ux6x1_t __attribute__((matrix_type(6, 1)));

				void multiply_add_2x2(const fx2x2_t a, const fx2x2_t b, fx2x2_t *c) {
				// CHECK-LABEL: define{{...*}} void @multiply_add_2x2(
				// CHECK: [[A_ADDR:%.]] = alloca [25 x float], align 8
				// CHECK-NEXT: [[B_ADDR:%.]] = alloca [25 x float], align 8
				// CHECK-NEXT: [[C_ADDR:%.]] = alloca [25 x float], align 8
				// CHECK-NEXT: store [25 x float]* %a, [25 x float]** [[A_ADDR]], align 8
				// CHECK-NEXT: store [25 x float]* %b, [25 x float]** [[B_ADDR]], align 8
				// CHECK-NEXT: store [25 x float]* %c, [25 x float]** [[C_ADDR]], align 8
				// CHECK-NEXT: [[A_L:%.]] = load [25 x float], [25 x float]** [[A_ADDR]], align 8
				// CHECK-NEXT: [[A_B:%.]] = bitcast [25 x float] [[A_L]] to <25 x float>*
				// CHECK-NEXT: [[A:%.]] = load <25 x float>, <25 x float> [[A_B]], align 4
				// CHECK-NEXT: [[B_L:%.]] = load [25 x float], [25 x float]** [[B_ADDR]], align 8
				// CHECK-NEXT: [[B_B:%.]] = bitcast [25 x float] [[B_L]] to <25 x float>*
				// CHECK-NEXT: [[B:%.]] = load <25 x float>, <25 x float> [[B_B]], align 4
				// CHECK-NEXT: [[C_L:%.]] = load [25 x float], [25 x float]** [[C_ADDR]], align 8
				// CHECK-NEXT: [[C_B:%.]] = bitcast [25 x float] [[C_L]] to <25 x float>*
				// CHECK-NEXT: [[C:%.]] = load <25 x float>, <25 x float> [[C_B]], align 4
				// CHECK-NEXT: [[MADD:%.*]] = call <25 x float> @llvm.matrix.multiply.add.v25f32.v25f32.v25f32.v25f32(<25 x float> [[A]], <25 x float> [[B]], <25 x float> [[C]], i32 5, i32 5, i32 5)
				// CHECK-NEXT: [[CR_L:%.]] = load [25 x float], [25 x float]** [[C_ADDR]], align 8
				// CHECK-NEXT: [[CR_B:%.]] = bitcast [25 x float] [[CR_L]] to <25 x float>*
				// CHECK-NEXT: store <25 x float> [[MADD]], <25 x float>* [[CR_B]], align 4
				c = __builtin_matrix_multiply_add(a, b, c);
				}

	void transpose_double_5x5(dx5x5_t *a) {			void transpose_double_5x5(dx5x5_t *a) {
	// CHECK-LABEL: define{{.*}} void @transpose_double_5x5(			// CHECK-LABEL: define{{.*}} void @transpose_double_5x5(
	// CHECK: [[A:%.]] = load <25 x double>, <25 x double> {{.*}}, align 8			// CHECK: [[A:%.]] = load <25 x double>, <25 x double> {{.*}}, align 8
	// CHECK-NEXT: [[TRANS:%.*]] = call <25 x double> @llvm.matrix.transpose.v25f64(<25 x double> [[A]], i32 5, i32 5)			// CHECK-NEXT: [[TRANS:%.*]] = call <25 x double> @llvm.matrix.transpose.v25f64(<25 x double> [[A]], i32 5, i32 5)
	// CHECK-NEXT: [[AT_ADDR:%.]] = bitcast [25 x double] %a_t to <25 x double>*			// CHECK-NEXT: [[AT_ADDR:%.]] = bitcast [25 x double] %a_t to <25 x double>*
	// CHECK-NEXT: store <25 x double> [[TRANS]], <25 x double>* [[AT_ADDR]], align 8			// CHECK-NEXT: store <25 x double> [[TRANS]], <25 x double>* [[AT_ADDR]], align 8
	dx5x5_t a_t = __builtin_matrix_transpose(*a);			dx5x5_t a_t = __builtin_matrix_transpose(*a);
	}			}
	▲ Show 20 Lines • Show All 239 Lines • Show Last 20 Lines

clang/test/Sema/matrix-type-builtins.c

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	void column_major_store(sx5x10_t m1, ix3x2_t m2, float p1, int p2, struct Foo p3, const float p4) {
// expected-error@-1 {{assigning to 'sx5x10_t' (aka 'float __attribute__((matrix_type(5, 10)))') from incompatible type 'void'}}		// expected-error@-1 {{assigning to 'sx5x10_t' (aka 'float __attribute__((matrix_type(5, 10)))') from incompatible type 'void'}}

int x = __builtin_matrix_column_major_store(*m1, p1, 10);		int x = __builtin_matrix_column_major_store(*m1, p1, 10);
// expected-error@-1 {{initializing 'int' with an expression of incompatible type 'void'}}		// expected-error@-1 {{initializing 'int' with an expression of incompatible type 'void'}}

__builtin_matrix_column_major_store(*m1, p4, 20);		__builtin_matrix_column_major_store(*m1, p4, 20);
// expected-error@-1 {{cannot store matrix to read-only pointer}}		// expected-error@-1 {{cannot store matrix to read-only pointer}}
}		}

		void multiply_add(sx5x10_t a, sx5x10_t b, sx5x10_t c, dx3x3 d, dx3x3 e, ix3x3 f) {
		c = __builtin_matrix_multiply_add(a, b, c);
		// expected-error@-1 {{The number of columns of the 1st argument must be the same as the number of rows of the 2nd argument and the number of rows of the 1st argument and columns of the 2nd argument must match 3rd argument}}

		f = __builtin_matrix_multiply_add(d, e, f);
		// expected-error@-1 {{All arguments elements type must match}}

		f = __builtin_matrix_multiply_add(d, e, e);
		// expected-error@-1 {{assigning to 'ix3x3' (aka 'unsigned int __attribute__((matrix_type(3, 3)))') from incompatible type 'double __attribute__((matrix_type(3, 3)))'}}
		}

llvm/include/llvm/IR/Intrinsics.td

	Show First 20 Lines • Show All 1,565 Lines • ▼ Show 20 Lines

	def int_matrix_multiply			def int_matrix_multiply
	: DefaultAttrsIntrinsic<[llvm_anyvector_ty],			: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
	[llvm_anyvector_ty, llvm_anyvector_ty, llvm_i32_ty, llvm_i32_ty,			[llvm_anyvector_ty, llvm_anyvector_ty, llvm_i32_ty, llvm_i32_ty,
	llvm_i32_ty],			llvm_i32_ty],
	[IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<2>>,			[IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<2>>,
	ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]>;			ImmArg<ArgIndex<3>>, ImmArg<ArgIndex<4>>]>;

				def int_matrix_multiply_add
				: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
				[llvm_anyvector_ty, llvm_anyvector_ty, llvm_anyvector_ty, llvm_i32_ty, llvm_i32_ty,
				llvm_i32_ty],
				[IntrNoSync, IntrWillReturn, IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<3>>,
				ImmArg<ArgIndex<4>>, ImmArg<ArgIndex<5>>]>;

	def int_matrix_column_major_load			def int_matrix_column_major_load
	: DefaultAttrsIntrinsic<[llvm_anyvector_ty],			: DefaultAttrsIntrinsic<[llvm_anyvector_ty],
	[LLVMPointerToElt<0>, llvm_i64_ty, llvm_i1_ty,			[LLVMPointerToElt<0>, llvm_i64_ty, llvm_i1_ty,
	llvm_i32_ty, llvm_i32_ty],			llvm_i32_ty, llvm_i32_ty],
	[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem,			[IntrNoSync, IntrWillReturn, IntrArgMemOnly, IntrReadMem,
	NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>,			NoCapture<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>,
	ImmArg<ArgIndex<4>>]>;			ImmArg<ArgIndex<4>>]>;

	▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

llvm/include/llvm/IR/MatrixBuilder.h

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	CallInst CreateMatrixTranspose(Value Matrix, unsigned Rows,
Type *OverloadedTypes[] = {ReturnType};		Type *OverloadedTypes[] = {ReturnType};
Value *Ops[] = {Matrix, B.getInt32(Rows), B.getInt32(Columns)};		Value *Ops[] = {Matrix, B.getInt32(Rows), B.getInt32(Columns)};
Function *TheFn = Intrinsic::getDeclaration(		Function *TheFn = Intrinsic::getDeclaration(
getModule(), Intrinsic::matrix_transpose, OverloadedTypes);		getModule(), Intrinsic::matrix_transpose, OverloadedTypes);

return B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);		return B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);
}		}

		/// Create a llvm.matrix.multiply.add call, multiplying matrixes \p LHS and \p
		/// RHS and adding the result to \p ACC.
		CallInst CreateMatrixMultiplyAdd(Value LHS, Value RHS, Value ACC,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'CreateMatrixMultiplyAdd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'CreateMatrixMultiplyAdd' [readability…
		unsigned LHSRows, unsigned LHSColumns,
		unsigned RHSColumns,
		const Twine &Name = "") {
		auto *LHSType = cast<VectorType>(LHS->getType());
		auto *RHSType = cast<VectorType>(RHS->getType());
		auto *AccType = cast<VectorType>(ACC->getType());

		auto *ReturnType =
		FixedVectorType::get(LHSType->getElementType(), LHSRows * RHSColumns);
		Value *Ops[] = {LHS,
		RHS,
		ACC,
		B.getInt32(LHSRows),
		B.getInt32(LHSColumns),
		B.getInt32(RHSColumns)};
		Type *OverloadedTypes[] = {ReturnType, LHSType, RHSType, AccType};

		Function *TheFn = Intrinsic::getDeclaration(
		getModule(), Intrinsic::matrix_multiply_add, OverloadedTypes);
		return B.CreateCall(TheFn->getFunctionType(), TheFn, Ops, Name);
		}

		jdoerfertUnsubmitted Not Done Reply Inline Actions This code is not tested, as far as I can tell. Or is it? jdoerfert: This code is not tested, as far as I can tell. Or is it?
/// Create a llvm.matrix.multiply call, multiplying matrixes \p LHS and \p		/// Create a llvm.matrix.multiply call, multiplying matrixes \p LHS and \p
/// RHS.		/// RHS.
CallInst CreateMatrixMultiply(Value LHS, Value *RHS, unsigned LHSRows,		CallInst CreateMatrixMultiply(Value LHS, Value *RHS, unsigned LHSRows,
unsigned LHSColumns, unsigned RHSColumns,		unsigned LHSColumns, unsigned RHSColumns,
const Twine &Name = "") {		const Twine &Name = "") {
auto *LHSType = cast<VectorType>(LHS->getType());		auto *LHSType = cast<VectorType>(LHS->getType());
auto *RHSType = cast<VectorType>(RHS->getType());		auto *RHSType = cast<VectorType>(RHS->getType());

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Show First 20 Lines • Show All 505 Lines • ▼ Show 20 Lines	bool supportsShapeInfo(Value *V) {
Instruction *Inst = dyn_cast<Instruction>(V);		Instruction *Inst = dyn_cast<Instruction>(V);
if (!Inst)		if (!Inst)
return false;		return false;

IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);		IntrinsicInst *II = dyn_cast<IntrinsicInst>(Inst);
if (II)		if (II)
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
		case Intrinsic::matrix_multiply_add:
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_column_major_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_column_major_store:		case Intrinsic::matrix_column_major_store:
return true;		return true;
default:		default:
return false;		return false;
}		}
return isUniformShape(V) \|\| isa<StoreInst>(V) \|\| isa<LoadInst>(V);		return isUniformShape(V) \|\| isa<StoreInst>(V) \|\| isa<LoadInst>(V);
Show All 13 Lines	propagateShapeForward(SmallVectorImpl<Instruction *> &WorkList) {
while (!WorkList.empty()) {		while (!WorkList.empty()) {
Instruction *Inst = WorkList.pop_back_val();		Instruction *Inst = WorkList.pop_back_val();

// New entry, set the value and insert operands		// New entry, set the value and insert operands
bool Propagate = false;		bool Propagate = false;

Value *MatrixA;		Value *MatrixA;
Value *MatrixB;		Value *MatrixB;
		Value *MatrixC;
Value *M;		Value *M;
Value *N;		Value *N;
Value *K;		Value *K;
if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(		if (match(Inst, m_Intrinsic<Intrinsic::matrix_multiply>(
m_Value(MatrixA), m_Value(MatrixB), m_Value(M),		m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
m_Value(N), m_Value(K)))) {		m_Value(N), m_Value(K)))) {
Propagate = setShapeInfo(Inst, {M, K});		Propagate = setShapeInfo(Inst, {M, K});
		} else if (match(Inst,
		m_Intrinsic<Intrinsic::matrix_multiply_add>(
		m_Value(MatrixA), m_Value(MatrixB), m_Value(MatrixC),
		m_Value(M), m_Value(N), m_Value(K)))) {
		Propagate = setShapeInfo(Inst, {M, K});
} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_transpose>(
m_Value(MatrixA), m_Value(M), m_Value(N)))) {		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
// Flip dimensions.		// Flip dimensions.
Propagate = setShapeInfo(Inst, {N, M});		Propagate = setShapeInfo(Inst, {N, M});
} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_column_major_store>(		} else if (match(Inst, m_Intrinsic<Intrinsic::matrix_column_major_store>(
m_Value(MatrixA), m_Value(), m_Value(),		m_Value(MatrixA), m_Value(), m_Value(),
m_Value(), m_Value(M), m_Value(N)))) {		m_Value(), m_Value(M), m_Value(N)))) {
Propagate = setShapeInfo(Inst, {N, M});		Propagate = setShapeInfo(Inst, {N, M});
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	while (!WorkList.empty()) {
Value *V = WorkList.pop_back_val();		Value *V = WorkList.pop_back_val();

size_t BeforeProcessingV = WorkList.size();		size_t BeforeProcessingV = WorkList.size();
if (!isa<Instruction>(V))		if (!isa<Instruction>(V))
continue;		continue;

Value *MatrixA;		Value *MatrixA;
Value *MatrixB;		Value *MatrixB;
		Value *MatrixC;
Value *M;		Value *M;
Value *N;		Value *N;
Value *K;		Value *K;
if (match(V, m_Intrinsic<Intrinsic::matrix_multiply>(		if (match(V, m_Intrinsic<Intrinsic::matrix_multiply>(
m_Value(MatrixA), m_Value(MatrixB), m_Value(M),		m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
m_Value(N), m_Value(K)))) {		m_Value(N), m_Value(K)))) {
if (setShapeInfo(MatrixA, {M, N}))		if (setShapeInfo(MatrixA, {M, N}))
pushInstruction(MatrixA, WorkList);		pushInstruction(MatrixA, WorkList);

if (setShapeInfo(MatrixB, {N, K}))		if (setShapeInfo(MatrixB, {N, K}))
pushInstruction(MatrixB, WorkList);		pushInstruction(MatrixB, WorkList);
		} else if (match(V,
		m_Intrinsic<Intrinsic::matrix_multiply_add>(
		m_Value(MatrixA), m_Value(MatrixB), m_Value(MatrixC),
		m_Value(M), m_Value(N), m_Value(K)))) {
		if (setShapeInfo(MatrixA, {M, N}))
		pushInstruction(MatrixA, WorkList);

		if (setShapeInfo(MatrixB, {N, K}))
		pushInstruction(MatrixB, WorkList);

		if (setShapeInfo(MatrixC, {M, K}))
		pushInstruction(MatrixC, WorkList);
} else if (match(V, m_Intrinsic<Intrinsic::matrix_transpose>(		} else if (match(V, m_Intrinsic<Intrinsic::matrix_transpose>(
m_Value(MatrixA), m_Value(M), m_Value(N)))) {		m_Value(MatrixA), m_Value(M), m_Value(N)))) {
// Flip dimensions.		// Flip dimensions.
if (setShapeInfo(MatrixA, {M, N}))		if (setShapeInfo(MatrixA, {M, N}))
pushInstruction(MatrixA, WorkList);		pushInstruction(MatrixA, WorkList);
} else if (match(V, m_Intrinsic<Intrinsic::matrix_column_major_store>(		} else if (match(V, m_Intrinsic<Intrinsic::matrix_column_major_store>(
m_Value(MatrixA), m_Value(), m_Value(), m_Value(),		m_Value(MatrixA), m_Value(), m_Value(), m_Value(),
m_Value(M), m_Value(N)))) {		m_Value(M), m_Value(N)))) {
Show All 34 Lines	if (EnableShapePropagation) {
for (BasicBlock &BB : Func)		for (BasicBlock &BB : Func)
for (Instruction &Inst : BB) {		for (Instruction &Inst : BB) {
IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);		IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst);
if (!II)		if (!II)
continue;		continue;

switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
		case Intrinsic::matrix_multiply_add:
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
case Intrinsic::matrix_column_major_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_column_major_store:		case Intrinsic::matrix_column_major_store:
WorkList.push_back(&Inst);		WorkList.push_back(&Inst);
break;		break;
default:		default:
break;		break;
}		}
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	case Intrinsic::matrix_transpose:
LowerTranspose(Inst);		LowerTranspose(Inst);
break;		break;
case Intrinsic::matrix_column_major_load:		case Intrinsic::matrix_column_major_load:
LowerColumnMajorLoad(Inst);		LowerColumnMajorLoad(Inst);
break;		break;
case Intrinsic::matrix_column_major_store:		case Intrinsic::matrix_column_major_store:
LowerColumnMajorStore(Inst);		LowerColumnMajorStore(Inst);
break;		break;
		case Intrinsic::matrix_multiply_add:
		LowerMultiplyAdd(Inst);
		break;
default:		default:
return false;		return false;
}		}
return true;		return true;
}		}

/// Compute the alignment for a column/row \p Idx with \p Stride between them.		/// Compute the alignment for a column/row \p Idx with \p Stride between them.
/// The address at \p Idx == 0 has alignment \p A. If \p Stride is a		/// The address at \p Idx == 0 has alignment \p A. If \p Stride is a
▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	for (Use &U : llvm::make_early_inc_range(Inst->uses())) {
if (ShapeMap.find(U.getUser()) == ShapeMap.end()) {		if (ShapeMap.find(U.getUser()) == ShapeMap.end()) {
if (!Flattened)		if (!Flattened)
Flattened = Matrix.embedInVector(Builder);		Flattened = Matrix.embedInVector(Builder);
U.set(Flattened);		U.set(Flattened);
}		}
}		}
}		}

/// Compute \p Result += \p A * \p B for input matrices with left-associating		/// Compute \p Result += \p A * \p B + \p ACC for input matrices with
/// addition.		/// left-associating addition.
		template <bool isAccumulating = false>
void emitMatrixMultiply(MatrixTy &Result, const MatrixTy &A,		void emitMatrixMultiply(MatrixTy &Result, const MatrixTy &A,
const MatrixTy &B, bool AllowContraction,		const MatrixTy &B, bool AllowContraction,
IRBuilder<> &Builder, bool isTiled) {		IRBuilder<> &Builder, bool isTiled,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for parameter 'isTiled' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for parameter 'isTiled' [readability-identifier-naming]…
		const MatrixTy *ACC = nullptr) {
const unsigned VF = std::max<unsigned>(		const unsigned VF = std::max<unsigned>(
TTI.getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)		TTI.getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
.getFixedSize() /		.getFixedSize() /
Result.getElementType()->getPrimitiveSizeInBits().getFixedSize(),		Result.getElementType()->getPrimitiveSizeInBits().getFixedSize(),
1U);		1U);
unsigned R = Result.getNumRows();		unsigned R = Result.getNumRows();
unsigned C = Result.getNumColumns();		unsigned C = Result.getNumColumns();
unsigned M = A.getNumColumns();		unsigned M = A.getNumColumns();

bool IsFP = Result.getElementType()->isFloatingPointTy();		bool IsFP = Result.getElementType()->isFloatingPointTy();
assert(A.isColumnMajor() == B.isColumnMajor() &&		assert(A.isColumnMajor() == B.isColumnMajor() &&
Result.isColumnMajor() == A.isColumnMajor() &&		Result.isColumnMajor() == A.isColumnMajor() &&
"operands must agree on matrix layout");		"operands must agree on matrix layout");
unsigned NumComputeOps = 0;		unsigned NumComputeOps = 0;
if (A.isColumnMajor()) {		if (A.isColumnMajor()) {
// Multiply columns from the first operand with scalars from the second		// Multiply columns from the first operand with scalars from the second
// operand. Then move along the K axes and accumulate the columns. With		// operand. Then move along the K axes and accumulate the columns. With
// this the adds can be vectorized without reassociation.		// this the adds can be vectorized without reassociation.
for (unsigned J = 0; J < C; ++J) {		for (unsigned J = 0; J < C; ++J) {
unsigned BlockSize = VF;		unsigned BlockSize = VF;
// If Result is zero, we don't need to accumulate in the K==0 iteration.		// If Result is zero, we don't need to accumulate in the K==0 iteration.
bool isSumZero = isa<ConstantAggregateZero>(Result.getColumn(J));		bool isSumZero = isAccumulating
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isSumZero' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isSumZero' [readability-identifier…
		? false
		: isa<ConstantAggregateZero>(Result.getColumn(J));

for (unsigned I = 0; I < R; I += BlockSize) {		for (unsigned I = 0; I < R; I += BlockSize) {
// Gradually lower the vectorization factor to cover the remainder.		// Gradually lower the vectorization factor to cover the remainder.
while (I + BlockSize > R)		while (I + BlockSize > R)
BlockSize /= 2;		BlockSize /= 2;

Value *Sum = isTiled ? Result.extractVector(I, J, BlockSize, Builder)		Value *Sum =
		isAccumulating ? ACC->extractVector(I, J, BlockSize, Builder)
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - isAccumulating ? ACC->extractVector(I, J, BlockSize, Builder) - : isTiled ? Result.extractVector(I, J, BlockSize, Builder) - : nullptr; + isAccumulating + ? ACC->extractVector(I, J, BlockSize, Builder) + : isTiled ? Result.extractVector(I, J, BlockSize, Builder) + : nullptr; Lint: Pre-merge checks: clang-format: please reformat the code ``` - isAccumulating ? ACC->extractVector(I…
		: isTiled ? Result.extractVector(I, J, BlockSize, Builder)
: nullptr;		: nullptr;
		;
for (unsigned K = 0; K < M; ++K) {		for (unsigned K = 0; K < M; ++K) {
Value *L = A.extractVector(I, K, BlockSize, Builder);		Value *L = A.extractVector(I, K, BlockSize, Builder);
Value *RH = Builder.CreateExtractElement(B.getColumn(J), K);		Value *RH = Builder.CreateExtractElement(B.getColumn(J), K);
Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");		Value *Splat = Builder.CreateVectorSplat(BlockSize, RH, "splat");
Sum = createMulAdd(isSumZero && K == 0 ? nullptr : Sum, L, Splat,		Sum = createMulAdd(isSumZero && K == 0 ? nullptr : Sum, L, Splat,
Result.getElementType()->isFloatingPointTy(),		Result.getElementType()->isFloatingPointTy(),
Builder, AllowContraction, NumComputeOps);		Builder, AllowContraction, NumComputeOps);
}		}
Result.setVector(J,		Result.setVector(J,
insertVector(Result.getVector(J), I, Sum, Builder));		insertVector(Result.getVector(J), I, Sum, Builder));
}		}
}		}
} else {		} else {
// Multiply rows from the second operand with scalars from the first		// Multiply rows from the second operand with scalars from the first
// operand. Then move along the K axes and accumulate the rows. With this		// operand. Then move along the K axes and accumulate the rows. With this
// the adds can be vectorized without reassociation.		// the adds can be vectorized without reassociation.
for (unsigned I = 0; I < R; ++I) {		for (unsigned I = 0; I < R; ++I) {
unsigned BlockSize = VF;		unsigned BlockSize = VF;
bool isSumZero = isa<ConstantAggregateZero>(Result.getRow(I));		bool isSumZero = isAccumulating
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'isSumZero' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'isSumZero' [readability-identifier…
		? false
		: isa<ConstantAggregateZero>(Result.getRow(I));
for (unsigned J = 0; J < C; J += BlockSize) {		for (unsigned J = 0; J < C; J += BlockSize) {
// Gradually lower the vectorization factor to cover the remainder.		// Gradually lower the vectorization factor to cover the remainder.
while (J + BlockSize > C)		while (J + BlockSize > C)
BlockSize /= 2;		BlockSize /= 2;

Value *Sum = nullptr;		Value *Sum = isAccumulating
		? ACC->extractVector(I, J, BlockSize, Builder)
		: nullptr;
for (unsigned K = 0; K < M; ++K) {		for (unsigned K = 0; K < M; ++K) {
Value *R = B.extractVector(K, J, BlockSize, Builder);		Value *R = B.extractVector(K, J, BlockSize, Builder);
Value *LH = Builder.CreateExtractElement(A.getVector(I), K);		Value *LH = Builder.CreateExtractElement(A.getVector(I), K);
Value *Splat = Builder.CreateVectorSplat(BlockSize, LH, "splat");		Value *Splat = Builder.CreateVectorSplat(BlockSize, LH, "splat");
Sum = createMulAdd(isSumZero && K == 0 ? nullptr : Sum, Splat, R,		Sum = createMulAdd(isSumZero && K == 0 ? nullptr : Sum, Splat, R,
IsFP, Builder, AllowContraction, NumComputeOps);		IsFP, Builder, AllowContraction, NumComputeOps);
}		}
Result.setVector(I,		Result.setVector(I,
▲ Show 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	if (LoadOp0 && LoadOp1 && Store) {
if (AddrI && (!DT->dominates(AddrI, MatMul)))		if (AddrI && (!DT->dominates(AddrI, MatMul)))
return;		return;

emitSIMDTiling(MatMul, LoadOp0, LoadOp1, Store, FusedInsts);		emitSIMDTiling(MatMul, LoadOp0, LoadOp1, Store, FusedInsts);
return;		return;
}		}
}		}

		/// Lowers llvm.matrix.multiply.add
		void LowerMultiplyAdd(CallInst *MatMulAdd) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'LowerMultiplyAdd' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'LowerMultiplyAdd' [readability-identifier…
		IRBuilder<> Builder(MatMulAdd);
		auto *EltType = cast<VectorType>(MatMulAdd->getType())->getElementType();
		ShapeInfo LShape(MatMulAdd->getArgOperand(3), MatMulAdd->getArgOperand(4));
		ShapeInfo RShape(MatMulAdd->getArgOperand(4), MatMulAdd->getArgOperand(5));
		ShapeInfo AShape(MatMulAdd->getArgOperand(3), MatMulAdd->getArgOperand(5));

		const MatrixTy &Lhs =
		getMatrix(MatMulAdd->getArgOperand(0), LShape, Builder);
		const MatrixTy &Rhs =
		getMatrix(MatMulAdd->getArgOperand(1), RShape, Builder);
		const MatrixTy &Acc =
		getMatrix(MatMulAdd->getArgOperand(2), AShape, Builder);
		assert(Lhs.getElementType() == Rhs.getElementType() &&
		"Matrix multiply argument element types do not match.");

		const unsigned R = LShape.NumRows;
		const unsigned C = RShape.NumColumns;
		assert(LShape.NumColumns == RShape.NumRows);

		// Initialize the output
		MatrixTy Result(R, C, EltType);
		assert(Lhs.getElementType() == Result.getElementType() &&
		"Matrix multiply result element type does not match arguments.");

		bool AllowContract =
		AllowContractEnabled \|\|
		(isa<FPMathOperator>(MatMulAdd) && MatMulAdd->hasAllowContract());
		emitMatrixMultiply<true>(Result, Lhs, Rhs, AllowContract, Builder, false,
		&Acc);
		finalizeLowering(MatMulAdd, Result, Builder);
		}

/// Lowers llvm.matrix.multiply.		/// Lowers llvm.matrix.multiply.
void LowerMultiply(CallInst *MatMul) {		void LowerMultiply(CallInst *MatMul) {
IRBuilder<> Builder(MatMul);		IRBuilder<> Builder(MatMul);
auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();		auto *EltType = cast<VectorType>(MatMul->getType())->getElementType();
ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));		ShapeInfo LShape(MatMul->getArgOperand(2), MatMul->getArgOperand(3));
ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));		ShapeInfo RShape(MatMul->getArgOperand(3), MatMul->getArgOperand(4));

const MatrixTy &Lhs = getMatrix(MatMul->getArgOperand(0), LShape, Builder);		const MatrixTy &Lhs = getMatrix(MatMul->getArgOperand(0), LShape, Builder);
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	void writeFnName(CallInst *CI) {

switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
prettyPrintMatrixType(II->getOperand(0), SS);		prettyPrintMatrixType(II->getOperand(0), SS);
SS << ".";		SS << ".";
prettyPrintMatrixType(II->getOperand(1), SS);		prettyPrintMatrixType(II->getOperand(1), SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
		case Intrinsic::matrix_multiply_add:
		prettyPrintMatrixType(II->getOperand(0), SS);
		SS << ".";
		prettyPrintMatrixType(II->getOperand(1), SS);
		SS << "." << *II->getType()->getScalarType();
		prettyPrintMatrixType(II->getOperand(2), SS);
		SS << "." << *II->getType()->getScalarType();
		break;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
prettyPrintMatrixType(II->getOperand(0), SS);		prettyPrintMatrixType(II->getOperand(0), SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
case Intrinsic::matrix_column_major_load:		case Intrinsic::matrix_column_major_load:
prettyPrintMatrixType(II, SS);		prettyPrintMatrixType(II, SS);
SS << "." << *II->getType()->getScalarType();		SS << "." << *II->getType()->getScalarType();
break;		break;
case Intrinsic::matrix_column_major_store:		case Intrinsic::matrix_column_major_store:
prettyPrintMatrixType(II->getOperand(0), SS);		prettyPrintMatrixType(II->getOperand(0), SS);
SS << "." << *II->getOperand(0)->getType()->getScalarType();		SS << "." << *II->getOperand(0)->getType()->getScalarType();
break;		break;
default:		default:
llvm_unreachable("Unhandled case");		llvm_unreachable("Unhandled case");
}		}
SS.flush();		SS.flush();
write(Tmp);		write(Tmp);
}		}
}		}

unsigned getNumShapeArgs(CallInst *CI) const {		unsigned getNumShapeArgs(CallInst *CI) const {
if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CI)) {		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CI)) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
case Intrinsic::matrix_multiply:		case Intrinsic::matrix_multiply:
		case Intrinsic::matrix_multiply_add:
return 3;		return 3;
case Intrinsic::matrix_transpose:		case Intrinsic::matrix_transpose:
return 2;		return 2;
case Intrinsic::matrix_column_major_load:		case Intrinsic::matrix_column_major_load:
case Intrinsic::matrix_column_major_store:		case Intrinsic::matrix_column_major_store:
return 3;		return 3;
default:		default:
return 0;		return 0;
▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.Needs RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 333605

clang/docs/MatrixTypes.rst

clang/include/clang/Basic/Builtins.def

clang/include/clang/Basic/DiagnosticSemaKinds.td

clang/include/clang/Sema/Sema.h

clang/lib/CodeGen/CGBuiltin.cpp

clang/lib/Sema/SemaChecking.cpp

clang/test/CodeGen/matrix-type-builtins.c

clang/test/Sema/matrix-type-builtins.c

llvm/include/llvm/IR/Intrinsics.td

llvm/include/llvm/IR/MatrixBuilder.h

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

[Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.
Needs RevisionPublic