This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/
-
mlir/
-
Dialect/
-
GPU/
1/1
GPUDialect.h
-
GPUOps.td
-
lib/
-
Conversion/GPUToNVVM/
-
GPUToNVVM/
-
LowerGpuOpsToNVVMOps.cpp
-
WmmaOpsToNvvm.cpp
-
Dialect/GPU/IR/
-
GPU/
-
IR/
-
GPUDialect.cpp
-
test/
-
Conversion/GPUToNVVM/
-
GPUToNVVM/
-
wmma-ops-to-nvvm.mlir
-
Dialect/GPU/
-
GPU/
-
invalid.mlir
-
Integration/GPU/CUDA/TensorCore/
-
GPU/
-
CUDA/
-
TensorCore/
-
wmma-matmul-f16.mlir
-
wmma-matmul-f32.mlir

Differential D103023

[mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops.
ClosedPublic

Authored by ThomasRaoux on May 24 2021, 7:01 AM.

Download Raw Diff

Details

Reviewers

navdeepkk
bondhugula
mravishankar
herhut
nicolasvasilache

Commits

rGb44007bec247: [mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops.

Summary

In order to allow large matmul operations using the MMA ops we need to chain operations this is not possible unless "DOp" and "COp" type have matching layout so remove the "DOp" layout and force accumulator and result type to match.
Added a test for the case where the MMA value is accumulated.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ThomasRaoux created this revision.May 24 2021, 7:01 AM

Herald added a reviewer: mravishankar. · View Herald TranscriptMay 24 2021, 7:01 AM

Herald added subscribers: dcaballe, cota, teijeong and 18 others. · View Herald Transcript

ThomasRaoux requested review of this revision.May 24 2021, 7:01 AM

Herald added a reviewer: herhut. · View Herald TranscriptMay 24 2021, 7:01 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Note that I was also considering just removing the "DOp" altogether and have it always use "COp" but I didn't know if it was consistent with the direction you have in mind. This would be a good step in the direction of potentially removing those operands altogether.
Let me know what you think.

Harbormaster completed remote builds in B105900: Diff 347380.May 24 2021, 7:43 AM

In D103023#2777141, @ThomasRaoux wrote:

Note that I was also considering just removing the "DOp" altogether and have it always use "COp" but I didn't know if it was consistent with the direction you have in mind. This would be a good step in the direction of potentially removing those operands altogether.
Let me know what you think.

Hi @ThomasRaoux, It Seems like DOp is redundant. Everyone(performance-centric) would use the ops in the pattern you have in the test case. So it would be good if we could simply drop DOp and use COp instead.

In D103023#2778914, @navdeepkk wrote:

In D103023#2777141, @ThomasRaoux wrote:

Note that I was also considering just removing the "DOp" altogether and have it always use "COp" but I didn't know if it was consistent with the direction you have in mind. This would be a good step in the direction of potentially removing those operands altogether.
Let me know what you think.

Hi @ThomasRaoux, It Seems like DOp is redundant. Everyone(performance-centric) would use the ops in the pattern you have in the test case. So it would be good if we could simply drop DOp and use COp instead.

Sounds good. Please take another look.

Harbormaster completed remote builds in B106084: Diff 347663.May 25 2021, 7:36 AM

ThomasRaoux added a reviewer: nicolasvasilache.May 27 2021, 7:42 AM

This is a great addition.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
116	nit: extra angle bracket.

This revision is now accepted and ready to land.May 27 2021, 8:53 AM

remove extra > and rebase.

ThomasRaoux marked an inline comment as done.May 27 2021, 9:13 AM

This revision was landed with ongoing or failed builds.May 27 2021, 9:14 AM

Closed by commit rGb44007bec247: [mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops. (authored by ThomasRaoux). · Explain Why

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rGb44007bec247: [mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops..

Harbormaster completed remote builds in B106539: Diff 348296.May 27 2021, 9:50 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUDialect.h

19 lines

GPUOps.td

13 lines

lib/

Conversion/

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

2 lines

WmmaOpsToNvvm.cpp

5 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

8 lines

test/

Conversion/

GPUToNVVM/

wmma-ops-to-nvvm.mlir

78 lines

Dialect/

GPU/

invalid.mlir

26 lines

Integration/

GPU/

CUDA/

TensorCore/

wmma-matmul-f16.mlir

4 lines

wmma-matmul-f32.mlir

4 lines

Diff 348302

mlir/include/mlir/Dialect/GPU/GPUDialect.h

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	struct MMAMatrixStorageType : public TypeStorage {

/// Number of dimensions in the MMA matrix.		/// Number of dimensions in the MMA matrix.
unsigned numDims;		unsigned numDims;

/// Element type of elements held in the MMA matrix.		/// Element type of elements held in the MMA matrix.
Type elementType;		Type elementType;

/// MMA operand that this MMAMatrix holds. The general form of operation this		/// MMA operand that this MMAMatrix holds. The general form of operation this
/// type supports is given by the equation D = (alpha(AB)) + (beta*C). This		/// type supports is given by the equation C += A*B. This field specifies
/// field specifies which operand in the given equation is held by this type.		/// which operand in the given equation is held by this type. The valid values
/// The valid values are "AOp", "BOp", "COp" and "DOp".		/// are "AOp", "BOp" and "COp".
StringRef operand;		StringRef operand;
};		};

/// MMAMatrix represents a matrix held by a subgroup for matrix-matrix multiply		/// MMAMatrix represents a matrix held by a subgroup for matrix-matrix multiply
/// accumulate operations. MMAMatrices are taken as direct operands by these		/// accumulate operations. MMAMatrices are taken as direct operands by these
/// operations and are also produced as results. These matrices are meant to		/// operations and are also produced as results. These matrices are meant to
/// reside in the registers. A limited number of pointwise operations can be		/// reside in the registers. A limited number of pointwise operations can be
/// performed on these matrices, i.e., operations which operate uniformly on		/// performed on these matrices, i.e., operations which operate uniformly on
/// all the elements in the matrix and do not change the order of matrix		/// all the elements in the matrix and do not change the order of matrix
/// elements. The above conditions exist because the layout of matrix elements		/// elements. The above conditions exist because the layout of matrix elements
/// inside the matrix is opaque i.e., the elements may be present in the		/// inside the matrix is opaque i.e., the elements may be present in the
/// matrix in any order. The general usage of this type is shown as follows:-		/// matrix in any order. The general usage of this type is shown as follows:-
///		///
/// %0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {leadDimension = 16 :		/// %0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {leadDimension = 16 :
/// index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">		/// index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
///		///
/// The MMAMatrixType describes the shape of the matrix being loaded and the		/// The MMAMatrixType describes the shape of the matrix being loaded and the
/// operand being loaded too. The operand needs to be specified to aid the		/// operand being loaded too. The operand needs to be specified to aid the
/// lowering of this type to dialects such as NVVM where each workitem may		/// lowering of this type to dialects such as NVVM where each workitem may
/// hold different amount of elements depending on the elementType of the		/// hold different amount of elements depending on the elementType of the
/// matrix. For e.g., Each workitem holds 4 vector<2xf16>s for f16 data type		/// matrix. For e.g., Each workitem holds 4 vector<2xf16>s for f16 data type
/// and 8 f32s for f32 data type of MMAMatrix. Some other instances of usage		/// and 8 f32s for f32 data type of MMAMatrix. Some other instances of usage
/// are:-		/// are:-
///		///
/// %3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16,		/// %3 = gpu.subgroup_mma_compute %0, %1, %2 :
/// "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf32,		/// !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">
		navdeepkkUnsubmitted Done Reply Inline Actions nit: extra angle bracket. navdeepkk: nit: extra angle bracket.
/// "COp"> -> !gpu.mma_matrix<16x16xf32, "DOp">		/// -> !gpu.mma_matrix<16x16xf32, "COp">
///		///
///		///
/// gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16		/// gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16
/// : index}: !gpu.mma_matrix<16x16xf32, "DOp">, memref<16x16xf32>		/// : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
// TODO: consider moving this to ODS.		// TODO: consider moving this to ODS.
class MMAMatrixType		class MMAMatrixType
: public Type::TypeBase<MMAMatrixType, Type, MMAMatrixStorageType> {		: public Type::TypeBase<MMAMatrixType, Type, MMAMatrixStorageType> {
public:		public:
using Base::Base;		using Base::Base;

/// Get MMAMatrixType and verify construction Invariants.		/// Get MMAMatrixType and verify construction Invariants.
static MMAMatrixType get(ArrayRef<int64_t> shape, Type elementType,		static MMAMatrixType get(ArrayRef<int64_t> shape, Type elementType,
Show All 19 Lines	public:

/// Get shape of the matrix.		/// Get shape of the matrix.
ArrayRef<int64_t> getShape() const;		ArrayRef<int64_t> getShape() const;

/// Get elementType of a single element.		/// Get elementType of a single element.
Type getElementType() const;		Type getElementType() const;

/// The general form of operation this type supports is given by the equation		/// The general form of operation this type supports is given by the equation
/// D = (alpha(AB)) + (beta*C). This function returns which operand in the		/// C += A*B. This function returns which operand in the given equation is
/// given equation is held by this type. String returned can be one of"AOp",		/// held by this type. String returned can be one of"AOp", "BOp" and "COp".
/// "BOp", "COp" and "DOp".
StringRef getOperand() const;		StringRef getOperand() const;
};		};

// Adds a `gpu.async.token` to the front of the argument list.		// Adds a `gpu.async.token` to the front of the argument list.
void addAsyncDependency(Operation *op, Value token);		void addAsyncDependency(Operation *op, Value token);

} // end namespace gpu		} // end namespace gpu
} // end namespace mlir		} // end namespace mlir
Show All 9 Lines

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 960 Lines • ▼ Show 20 Lines	let description = [{

This op is meant to be used along with `gpu.subgroup_mma_load_matrix` and		This op is meant to be used along with `gpu.subgroup_mma_load_matrix` and
`gpu.subgroup_mma_compute`.		`gpu.subgroup_mma_compute`.

Example:		Example:

```mlir		```mlir
gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32} :		gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32} :
!gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 3>		!gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
```		```
}];		}];

let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$src,		let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$src,
Arg<MemRefRankOf<[F16, F32], [2]>, "",[MemWrite]>:$dstMemref,		Arg<MemRefRankOf<[F16, F32], [2]>, "",[MemWrite]>:$dstMemref,
Variadic<Index>:$indices,		Variadic<Index>:$indices,
IndexAttr:$leadDimension);		IndexAttr:$leadDimension);

let assemblyFormat = [{		let assemblyFormat = [{
$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)		$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)
}];		}];

let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

def GPU_SubgroupMmaComputeOp : GPU_Op<"subgroup_mma_compute", []>{		def GPU_SubgroupMmaComputeOp : GPU_Op<"subgroup_mma_compute",
		[NoSideEffect, AllTypesMatch<["opC", "res"]>]>{

let summary = "GPU warp synchronous matrix multiply accumulate";		let summary = "GPU warp synchronous matrix multiply accumulate";

let description = [{		let description = [{
The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate(mma)		The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate(mma)
operation using all the threads in a subgroup.		operation using all the threads in a subgroup.

This operation takes three `!gpu.mma_matrix`s as arguments. All of them hold `A`,		This operation takes three `!gpu.mma_matrix`s as arguments. All of them hold `A`,
`B` and `C`operands for the mma operation. The operation performed is represented		`B` and `C`operands for the mma operation. The operation performed is represented
as `D = A * B + C`. The op returns a `!gpu.mma_matrix` which contains the result of		as `C += A * B`. The op returns a `!gpu.mma_matrix` which contains the result of
the operation held by the current thread.		the operation held by the current thread.

This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
`gpu.subgroup_mma_load_matrix`.		`gpu.subgroup_mma_load_matrix`.

Example:		Example:

```mlir		```mlir
%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :		%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">,		!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">>
!gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">		-> !gpu.mma_matrix<16x16xf16, "COp">
```		```
}];		}];

let arguments = (ins Arg<MMAMatrixOf<[F16]>>:$opA,		let arguments = (ins Arg<MMAMatrixOf<[F16]>>:$opA,
Arg<MMAMatrixOf<[F16]>>:$opB,		Arg<MMAMatrixOf<[F16]>>:$opB,
Arg<MMAMatrixOf<[F16, F32]>>:$opC);		Arg<MMAMatrixOf<[F16, F32]>>:$opC);

let results = (outs GPU_MMAMatrix:$res);		let results = (outs GPU_MMAMatrix:$res);

let assemblyFormat = [{		let assemblyFormat = [{
$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB)`,` type($opC) `->` type($res)		$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB) `->` type($res)
}];		}];

let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

#endif // GPU_OPS		#endif // GPU_OPS

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	void runOnOperation() override {
converter.addConversion([&](gpu::MMAMatrixType type) -> Type {		converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
// The number of items in structToReturn are dependent on the the dataType		// The number of items in structToReturn are dependent on the the dataType
// and the MMA operand that this operation is associated with.		// and the MMA operand that this operation is associated with.
llvm::DenseMap<StringRef, int64_t> numElemsPerThreadF16,		llvm::DenseMap<StringRef, int64_t> numElemsPerThreadF16,
numElemsPerThreadF32;		numElemsPerThreadF32;
numElemsPerThreadF16["AOp"] = 8;		numElemsPerThreadF16["AOp"] = 8;
numElemsPerThreadF16["BOp"] = 8;		numElemsPerThreadF16["BOp"] = 8;
numElemsPerThreadF16["COp"] = 4;		numElemsPerThreadF16["COp"] = 4;
numElemsPerThreadF16["DOp"] = 4;
numElemsPerThreadF32["AOp"] = 8;		numElemsPerThreadF32["AOp"] = 8;
numElemsPerThreadF32["BOp"] = 8;		numElemsPerThreadF32["BOp"] = 8;
numElemsPerThreadF32["COp"] = 8;		numElemsPerThreadF32["COp"] = 8;
numElemsPerThreadF32["DOp"] = 8;
Type structToReturn;		Type structToReturn;
if (type.getElementType().isF16()) {		if (type.getElementType().isF16()) {
// Number of f16's in 32-bit.		// Number of f16's in 32-bit.
unsigned vecSize = 2;		unsigned vecSize = 2;
Type vec = VectorType::get(vecSize, FloatType::getF16(&getContext()));		Type vec = VectorType::get(vecSize, FloatType::getF16(&getContext()));
unsigned size = numElemsPerThreadF16[type.getOperand()];		unsigned size = numElemsPerThreadF16[type.getOperand()];
SmallVector<Type> elements(size, vec);		SmallVector<Type> elements(size, vec);
structToReturn =		structToReturn =
▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

Show All 23 Lines
/// GPU subgroup ops to NVVM dialect.		/// GPU subgroup ops to NVVM dialect.
struct CommonLLVMAndBuiltInMLIRTypes {		struct CommonLLVMAndBuiltInMLIRTypes {
public:		public:
CommonLLVMAndBuiltInMLIRTypes(MLIRContext *context) {		CommonLLVMAndBuiltInMLIRTypes(MLIRContext *context) {
numHalfsInOpFrags.resize(4);		numHalfsInOpFrags.resize(4);
numHalfsInOpFrags[A] = 8;		numHalfsInOpFrags[A] = 8;
numHalfsInOpFrags[B] = 8;		numHalfsInOpFrags[B] = 8;
numHalfsInOpFrags[C] = 4;		numHalfsInOpFrags[C] = 4;
numHalfsInOpFrags[D] = 4;
i32Ty = IntegerType::get(context, 32);		i32Ty = IntegerType::get(context, 32);
f16Ty = FloatType::getF16(context);		f16Ty = FloatType::getF16(context);
f32Ty = FloatType::getF32(context);		f32Ty = FloatType::getF32(context);
f16x2Ty = VectorType::get(2, f16Ty);		f16x2Ty = VectorType::get(2, f16Ty);
fragArrayABTy = LLVM::LLVMStructType::getLiteral(		fragArrayABTy = LLVM::LLVMStructType::getLiteral(
context, SmallVector<Type>(8, f16x2Ty));		context, SmallVector<Type>(8, f16x2Ty));
fragArrayCDTy = LLVM::LLVMStructType::getLiteral(		fragArrayCDTy = LLVM::LLVMStructType::getLiteral(
context, SmallVector<Type>(4, f16x2Ty));		context, SmallVector<Type>(4, f16x2Ty));
Show All 17 Lines	public:
/// fp32 data type in a WMMA operation of the form D = (alpha(AB)) +		/// fp32 data type in a WMMA operation of the form D = (alpha(AB)) +
/// (beta*C).		/// (beta*C).
Type fragArrayCDF32Ty;		Type fragArrayCDF32Ty;
/// Represents the number of f16 elements a single thread holds in a WMMA		/// Represents the number of f16 elements a single thread holds in a WMMA
/// operation of the form D = (alpha(AB)) + (beta*C) .		/// operation of the form D = (alpha(AB)) + (beta*C) .
SmallVector<unsigned, 4> numHalfsInOpFrags;		SmallVector<unsigned, 4> numHalfsInOpFrags;
/// Represents the operands of a MMA operation of the form D = (alpha(AB)) +		/// Represents the operands of a MMA operation of the form D = (alpha(AB)) +
/// (beta*C).		/// (beta*C).
enum OperandMap { A, B, C, D };		enum OperandMap { A, B, C };
};		};

/// Checks if all the operands of the op being lowered are of LLVM Types. The		/// Checks if all the operands of the op being lowered are of LLVM Types. The
/// types are expected to be converted by the `LLVMTypeConverter` before the op		/// types are expected to be converted by the `LLVMTypeConverter` before the op
/// is actually lowered. If the type of an operands is not already converted it		/// is actually lowered. If the type of an operands is not already converted it
/// hints a missing typeConversion and failure is returned in that case.		/// hints a missing typeConversion and failure is returned in that case.
static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,		static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
ConversionPatternRewriter &rewriter) {		ConversionPatternRewriter &rewriter) {
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	gpu::MMAMatrixType srcType =
subgroupMmaStoreMatrixOp.src().getType().cast<gpu::MMAMatrixType>();		subgroupMmaStoreMatrixOp.src().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> srcTypeShape = srcType.getShape();		ArrayRef<int64_t> srcTypeShape = srcType.getShape();

// Unpack the results from the source.		// Unpack the results from the source.
if (subgroupMmaStoreMatrixOp.src()		if (subgroupMmaStoreMatrixOp.src()
.getType()		.getType()
.cast<gpu::MMAMatrixType>()		.cast<gpu::MMAMatrixType>()
.getElementType() == f16Ty) {		.getElementType() == f16Ty) {
for (unsigned i = 0, e = numHalfsInOpFrags[D]; i < e; ++i) {		for (unsigned i = 0, e = numHalfsInOpFrags[C]; i < e; ++i) {
Value toUse = rewriter.create<LLVM::ExtractValueOp>(		Value toUse = rewriter.create<LLVM::ExtractValueOp>(
loc, f16x2Ty, operands[0], rewriter.getI32ArrayAttr(i));		loc, f16x2Ty, operands[0], rewriter.getI32ArrayAttr(i));
storeOpOperands.push_back(toUse);		storeOpOperands.push_back(toUse);
}		}
storeOpOperands.push_back(leadingDim32);		storeOpOperands.push_back(leadingDim32);

// Create nvvm.mma_store op.		// Create nvvm.mma_store op.
if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16) {		if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16) {
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	bool MMAMatrixType::isValidElementType(Type elementType) {
return elementType.isF16() \|\| elementType.isF32();		return elementType.isF16() \|\| elementType.isF32();
}		}

LogicalResult		LogicalResult
MMAMatrixType::verify(function_ref<InFlightDiagnostic()> emitError,		MMAMatrixType::verify(function_ref<InFlightDiagnostic()> emitError,
ArrayRef<int64_t> shape, Type elementType,		ArrayRef<int64_t> shape, Type elementType,
StringRef operand) {		StringRef operand) {
if (!operand.equals("AOp") && !operand.equals("BOp") &&		if (!operand.equals("AOp") && !operand.equals("BOp") &&
!operand.equals("COp") && !operand.equals("DOp"))		!operand.equals("COp"))
return emitError() << "operand expected to be one of AOp, BOp, COp or DOp";		return emitError() << "operand expected to be one of AOp, BOp or COp";

if (shape.size() != 2)		if (shape.size() != 2)
return emitError() << "MMAMatrixType must have exactly two dimensions";		return emitError() << "MMAMatrixType must have exactly two dimensions";

if (!MMAMatrixType::isValidElementType(elementType))		if (!MMAMatrixType::isValidElementType(elementType))
return emitError() << "MMAMatrixType elements must be F16 or F32";		return emitError() << "MMAMatrixType elements must be F16 or F32";

return success();		return success();
▲ Show 20 Lines • Show All 945 Lines • ▼ Show 20 Lines	if (!dstMemrefType.getAffineMaps().empty() &&
return op.emitError("expected identity layout map for destination memref");		return op.emitError("expected identity layout map for destination memref");

if (dstMemSpace != kGenericMemorySpace && dstMemSpace != kSharedMemorySpace &&		if (dstMemSpace != kGenericMemorySpace && dstMemSpace != kSharedMemorySpace &&
dstMemSpace != kGlobalMemorySpace)		dstMemSpace != kGlobalMemorySpace)
return op.emitError(		return op.emitError(
"destination memorySpace of kGenericMemorySpace, "		"destination memorySpace of kGenericMemorySpace, "
"kGlobalMemorySpace or kSharedMemorySpace only allowed");		"kGlobalMemorySpace or kSharedMemorySpace only allowed");

if (!srcMatrixType.getOperand().equals("DOp"))		if (!srcMatrixType.getOperand().equals("COp"))
return op.emitError(		return op.emitError(
"expected the operand matrix being stored to have 'DOp' operand type");		"expected the operand matrix being stored to have 'COp' operand type");

return success();		return success();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPU_SubgroupMmaComputeOp		// GPU_SubgroupMmaComputeOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

Show All 32 Lines

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

Show All 25 Lines
}		}

// -----		// -----

gpu.module @test_module {		gpu.module @test_module {

// CHECK-LABEL: func @gpu_wmma_store_op		// CHECK-LABEL: func @gpu_wmma_store_op
// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {		// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {		func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>		%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
%i = constant 16 : index		%i = constant 16 : index
%j = constant 16 : index		%j = constant 16 : index
gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 3>		gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32		// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]		// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32		// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32		// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32		// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
// CHECK: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>		// CHECK: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32		// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>		// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
Show All 9 Lines	gpu.module @test_module {
}		}
}		}

// -----		// -----

gpu.module @test_module {		gpu.module @test_module {

// CHECK-LABEL: func @gpu_wmma_mma_op		// CHECK-LABEL: func @gpu_wmma_mma_op
// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {		// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {		func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> (!gpu.mma_matrix<16x16xf16, "COp">) {
%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">		%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
// CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A2:.*]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A2:.*]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A3:.*]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A3:.*]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A4:.*]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A4:.*]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A5:.*]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A5:.*]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A6:.*]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A6:.*]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A7:.*]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A7:.*]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[A8:.*]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[A8:.*]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B1:.*]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B1:.*]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B2:.*]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B2:.*]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B3:.*]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B3:.*]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B4:.*]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B4:.*]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B5:.*]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B5:.*]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B6:.*]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B6:.*]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B7:.*]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B7:.*]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[B8:.*]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[B8:.*]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[C1:.*]] = llvm.extractvalue %[[C]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[C1:.*]] = llvm.extractvalue %[[C]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[C2:.*]] = llvm.extractvalue %[[C]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[C2:.*]] = llvm.extractvalue %[[C]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[C3:.*]] = llvm.extractvalue %[[C]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[C3:.*]] = llvm.extractvalue %[[C]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %[[C4:.*]] = llvm.extractvalue %[[C]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[C4:.*]] = llvm.extractvalue %[[C]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: %{{.*}} = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>		// CHECK: %[[RES:.*]] = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
// CHECK: llvm.return		// CHECK: llvm.return %[[RES]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		return %D : !gpu.mma_matrix<16x16xf16, "COp">
		}
		}

		// -----

		gpu.module @test_module {

		// CHECK-LABEL: func @gpu_wmma_mma_loop_op
		// CHECK: %[[C:.+]] = nvvm.wmma.m16n16k16.load.c.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: llvm.br ^bb1(%{{.*}}, %[[C]] : i32, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
		// CHECK: ^bb1(%{{.*}}: i32, %[[ACC:.+]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>): // 2 preds: ^bb0, ^bb2
		// CHECK: llvm.cond_br %38, ^bb2, ^bb3
		// CHECK: ^bb2: // pred: ^bb1
		// CHECK: %[[A:.+]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B:.+]] = nvvm.wmma.m16n16k16.load.b.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A0:.+]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A1:.+]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A2:.+]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A3:.+]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A4:.+]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A5:.+]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A6:.+]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[A7:.+]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B0:.+]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B1:.+]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B2:.+]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B3:.+]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B4:.+]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B5:.+]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B6:.+]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[B7:.+]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[ACC0:.+]] = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[ACC1:.+]] = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[ACC2:.+]] = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[ACC3:.+]] = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %[[ACC_MUL:.+]] = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A0]], %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[B0]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[ACC0]], %[[ACC1]], %[[ACC2]], %[[ACC3]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: llvm.br ^bb1(%{{.*}}, %[[ACC_MUL]] : i32, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
		// CHECK: ^bb3: // pred: ^bb1
		// CHECK: %87 = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %88 = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %89 = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: %90 = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %86, %87, %88, %89, %90, %79 : !llvm.ptr<i32>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32

		func @gpu_wmma_mma_loop_op(%arg0: memref<128x128xf16>, %arg1: memref<128x128xf16>, %arg2: memref<128x128xf16>) {
		%c0 = constant 0 : index
		%c128 = constant 128 : index
		%c32 = constant 32 : index
		%0 = gpu.subgroup_mma_load_matrix %arg2[%c0, %c0] {leadDimension = 128 : index} : memref<128x128xf16> -> !gpu.mma_matrix<16x16xf16, "COp">
		br ^bb1(%c0, %0 : index, !gpu.mma_matrix<16x16xf16, "COp">)
		^bb1(%1: index, %2: !gpu.mma_matrix<16x16xf16, "COp">): // 2 preds: ^bb0, ^bb2
		%3 = cmpi slt, %1, %c128 : index
		cond_br %3, ^bb2, ^bb3
		^bb2: // pred: ^bb1
		%4 = gpu.subgroup_mma_load_matrix %arg0[%c0, %1] {leadDimension = 128 : index} : memref<128x128xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
		%5 = gpu.subgroup_mma_load_matrix %arg1[%1, %c0] {leadDimension = 128 : index} : memref<128x128xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
		%6 = gpu.subgroup_mma_compute %4, %5, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
		%7 = addi %1, %c32 : index
		br ^bb1(%7, %6 : index, !gpu.mma_matrix<16x16xf16, "COp">)
		^bb3: // pred: ^bb1
		gpu.subgroup_mma_store_matrix %2, %arg2[%c0, %c0] {leadDimension = 128 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<128x128xf16>
return		return
}		}
}		}

mlir/test/Dialect/GPU/invalid.mlir

Show First 20 Lines • Show All 468 Lines • ▼ Show 20 Lines	func @mmamatrix_invalid_shape(){
return		return
}		}

// -----		// -----

func @mmamatrix_operand_type(){		func @mmamatrix_operand_type(){
%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>		%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
%i = constant 16 : index		%i = constant 16 : index
// expected-error @+1 {{operand expected to be one of AOp, BOp, COp or DOp}}		// expected-error @+1 {{operand expected to be one of AOp, BOp or COp}}
%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "EOp">		%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "EOp">
return		return
}		}

// -----		// -----

func @mmamatrix_invalid_element_type(){		func @mmamatrix_invalid_element_type(){
%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>		%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
Show All 22 Lines	func @mmaLoadOp_invalid_mem_space(){
%i = constant 16 : index		%i = constant 16 : index
// expected-error @+1 {{source memorySpace kGenericMemorySpace, kSharedMemorySpace or kGlobalMemorySpace only allowed}}		// expected-error @+1 {{source memorySpace kGenericMemorySpace, kSharedMemorySpace or kGlobalMemorySpace only allowed}}
%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 5> -> !gpu.mma_matrix<16x16xf16, "AOp">		%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 5> -> !gpu.mma_matrix<16x16xf16, "AOp">
return		return
}		}

// -----		// -----

func @mmaLoadOp_operand_type(){
%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
%i = constant 16 : index
// expected-error @+1 {{only AOp, BOp and COp can be loaded}}
%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "DOp">
return
}

// -----

#layout_map_col_major = affine_map<(i, j) -> (j, i)>		#layout_map_col_major = affine_map<(i, j) -> (j, i)>

func @wmmaStoreOp_invalid_map(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {		func @wmmaStoreOp_invalid_map(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>		%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>
%i = constant 16 : index		%i = constant 16 : index
%j = constant 16 : index		%j = constant 16 : index
// expected-error @+1 {{expected identity layout map for destination memref}}		// expected-error @+1 {{expected identity layout map for destination memref}}
gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16,#layout_map_col_major, 3>		gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16,#layout_map_col_major, 3>
return		return
}		}

// -----		// -----

func @wmmaStoreOp_invalid_mem_space(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {		func @wmmaStoreOp_invalid_mem_space(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 5>		%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 5>
%i = constant 16 : index		%i = constant 16 : index
%j = constant 16 : index		%j = constant 16 : index
// expected-error @+1 {{destination memorySpace of kGenericMemorySpace, kGlobalMemorySpace or kSharedMemorySpace only allowed}}		// expected-error @+1 {{destination memorySpace of kGenericMemorySpace, kGlobalMemorySpace or kSharedMemorySpace only allowed}}
gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 5>		gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 5>
return		return
}		}

// -----		// -----

func @wmmaStoreOp_invalid_store_operand(%arg0 : !gpu.mma_matrix<16x16xf16, "AOp">) -> () {		func @wmmaStoreOp_invalid_store_operand(%arg0 : !gpu.mma_matrix<16x16xf16, "AOp">) -> () {
%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>		%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
%i = constant 16 : index		%i = constant 16 : index
%j = constant 16 : index		%j = constant 16 : index
// expected-error @+1 {{expected the operand matrix being stored to have 'DOp' operand type}}		// expected-error @+1 {{expected the operand matrix being stored to have 'COp' operand type}}
gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "AOp">, memref<32x32xf16, 3>		gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "AOp">, memref<32x32xf16, 3>
return		return
}		}

// -----		// -----

func @wmmaMmaOp_invalid_operand_order(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {		func @wmmaMmaOp_invalid_operand_order(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
// expected-error @+1 {{operands must be in the order AOp, BOp, COp}}		// expected-error @+1 {{operands must be in the order AOp, BOp, COp}}
%D = gpu.subgroup_mma_compute %B, %A, %C : !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">		%D = gpu.subgroup_mma_compute %B, %A, %C : !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "AOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
return		return
}		}

// -----		// -----

func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {		func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}		// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}
%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">		%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
return		return
}		}

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	module attributes {gpu.container_module} {
gpu.module @main_kernel {		gpu.module @main_kernel {
gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf16>) kernel {		gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf16>) kernel {
%c0 = constant 0 : index		%c0 = constant 0 : index

%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">		%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">		%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">		%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">		%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

gpu.subgroup_mma_store_matrix %3, %arg0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "DOp">, memref<16x16xf16>		gpu.subgroup_mma_store_matrix %3, %arg0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>

gpu.return		gpu.return
}		}
}		}

func private @print_memref_f32(memref<*xf32>)		func private @print_memref_f32(memref<*xf32>)
}		}

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	module attributes {gpu.container_module} {
gpu.module @main_kernel {		gpu.module @main_kernel {
gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf32>) kernel {		gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf32>) kernel {
%c0 = constant 0 : index		%c0 = constant 0 : index

%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">		%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">		%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">		%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf32, "COp"> -> !gpu.mma_matrix<16x16xf32, "DOp">		%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "DOp">, memref<16x16xf32>		gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>

gpu.return		gpu.return
}		}
}		}

func private @print_memref_f32(memref<*xf32>)		func private @print_memref_f32(memref<*xf32>)
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 348302

mlir/include/mlir/Dialect/GPU/GPUDialect.h

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir

[mlir][gpu] Relax restriction on MMA store op to allow chain of mma ops.
ClosedPublic