Diff 480135

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 1,135 Lines • ▼ Show 20 Lines	def GPU_SubgroupMmaStoreMatrixOp : GPU_Op<"subgroup_mma_store_matrix",
let description = [{		let description = [{
The `gpu.subgroup_mma_store_matrix` operation stores a matrix collectively		The `gpu.subgroup_mma_store_matrix` operation stores a matrix collectively
using all the threads in a subgroup.		using all the threads in a subgroup.

This operation takes a `!gpu.mma_matrix` and a memref as operands.		This operation takes a `!gpu.mma_matrix` and a memref as operands.
`!gpu.mma_matrix` is the source value containing the data to be stored into the		`!gpu.mma_matrix` is the source value containing the data to be stored into the
destination memref which can be in global or shared memory. The store address		destination memref which can be in global or shared memory. The store address
is determined using the indices provided. The `leadDimension` attribute		is determined using the indices provided. The `leadDimension` attribute
specifies the leading dimension of the destination matrix.		specifies the leading dimension of the destination matrix. If the
		`transpose` attribute is present then the op does a transposed store.

This op is often meant to be used along with `gpu.subgroup_mma_load_matrix` and		This op is often meant to be used along with `gpu.subgroup_mma_load_matrix` and
`gpu.subgroup_mma_compute`.		`gpu.subgroup_mma_compute`.

Example:		Example:

```mlir		```mlir
gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32}		gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32}
: !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>		: !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
```		```
}];		}];

let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$src,		let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$src,
Arg<GPU_MMAMemRef, "",[MemWrite]>:$dstMemref,		Arg<GPU_MMAMemRef, "",[MemWrite]>:$dstMemref,
Variadic<Index>:$indices,		Variadic<Index>:$indices,
IndexAttr:$leadDimension);		IndexAttr:$leadDimension,
		OptionalAttr<UnitAttr>:$transpose);
		ThomasRaouxUnsubmitted Done Reply Inline Actions please add some doc (or move it from below) ThomasRaoux: please add some doc (or move it from below)

let assemblyFormat = [{		let assemblyFormat = [{
$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)		$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

def GPU_SubgroupMmaComputeOp : GPU_Op<"subgroup_mma_compute",		def GPU_SubgroupMmaComputeOp
[Pure, AllTypesMatch<["opC", "res"]>]>{		: GPU_Op<"subgroup_mma_compute", [Pure, AllTypesMatch<["opC", "res"]>]> {

let summary = "GPU warp synchronous matrix multiply accumulate";		let summary = "GPU warp synchronous matrix multiply accumulate";

let description = [{		let description = [{
The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate (mma)		The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate (mma)
operation using all the threads in a subgroup.		operation using all the threads in a subgroup.

This operation takes three `!gpu.mma_matrix`s as arguments: these hold `A`,		This operation takes three `!gpu.mma_matrix`s as arguments: these hold `A`,
`B` and `C`operands for the mma operation. The operation performed is represented		`B` and `C`operands for the mma operation. The operation performed is represented
as `C += A * B`. The op returns a `!gpu.mma_matrix` which contains the result of		as `C += A * B`. The op returns a `!gpu.mma_matrix` which contains the result of
the operation held by all threads in a subgroup.		the operation held by all threads in a subgroup. `a_transpose` or
		`b_transpose` if present, signify that the respective operand was loaded in a
		transposed manner. The transpose opernads are required to map to correct
		underlying intrisics but they currently do not seem to affect correctness
		even if they are absent given that the operands were loaded correctly using
		the `transpose` attribute in `gpu.subgroup_mma_load_matrix` op.

		ThomasRaouxUnsubmitted Done Reply Inline Actions Do you understand why the compute operation has to know that? The transposition should happen during the load, I don't understand why we need to know this at that point. It looks like in your test you don't set it and it works fine. Maybe @christopherbate would know? ThomasRaoux: Do you understand why the compute operation has to know that? The transposition should happen…
		bondhugulaUnsubmitted Done Reply Inline Actions I didn't understand this question. Please see the NVVM lowering below. Why would the LLVM/NVVM builder take `aLayout` and `bLayout` in that case? rewriter.replaceOpWithNewOp<NVVM::WMMAMmaOp>( op, adaptor.getOpC().getType(), m, n, k, layout, layout, sourceType, op, adaptor.getOpC().getType(), m, n, k, aLayout, bLayout, sourceType, destType, unpackedOps); destType, unpackedOps); It looks like the mma intrinsic allows you to multiply elements in the transposed order and this is orthogonal to the load. So, you can load in whatever way, but do the multiplication on the transpose of the values if desired? bondhugula: I didn't understand this question. Please see the NVVM lowering below. Why would the LLVM/NVVM…
		ThomasRaouxUnsubmitted Done Reply Inline Actions It looks like the mma intrinsic allows you to multiply elements in the transposed order and this is orthogonal to the load. So, you can load in whatever way, but do the multiplication on the transpose of the values if desired? From the ptx spec this doesn't sound like what it is: The qualifiers .alayout and .blayout must match the layout specified on the wmma.load instructions that produce the contents of operands a and b respectively. Similarly, the qualifiers .atype, .btype and .ctype must match the corresponding qualifiers on the wmma.load instructions that produce the contents of operands a, b and c respectively. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma I'm guessing this is a way to allow a more flexible implementation of wmma but this makes the representation awkward. Interestingly enough SPIRV equivalent doesn't have this restriction. This is why I wanted to understand if we can simplify the design. ThomasRaoux: > It looks like the mma intrinsic allows you to multiply elements in the transposed order and…
		bondhugulaUnsubmitted Done Reply Inline Actions Shouldn't the LLVM (and NVVM) dialect mimic 1:1 the underlying intrinsics? If the underlying intrinsic expects it and requires it to be matched, the MLIR counterpart shouldn't remove it but expose the same requirements on the attributes. This should also be documented. bondhugula: Shouldn't the LLVM (and NVVM) dialect mimic 1:1 the underlying intrinsics? If the underlying…
		ThomasRaouxUnsubmitted Done Reply Inline Actions The LLVM and NVVM dialect should mimic 1:1 the intrinsics but in the GPU dialect can try to abstract those API specific details if possible. Anyway I don't have a good solution so it is fine to leave as is if nobody has a better idea, however the current solution feels very error prone. ThomasRaoux: The LLVM and NVVM dialect should mimic 1:1 the intrinsics but in the GPU dialect can try to…
This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
`gpu.subgroup_mma_load_matrix` ops.		`gpu.subgroup_mma_load_matrix` ops.

Example:		Example:

```mlir		```mlir
%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :		%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">>		!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">>
-> !gpu.mma_matrix<16x16xf16, "COp">		-> !gpu.mma_matrix<16x16xf16, "COp">
```		```
}];		}];

let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$opA,		let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$opA,
Arg<MMAMatrixOf<[F16, F32]>>:$opB,		Arg<MMAMatrixOf<[F16, F32]>>:$opB,
Arg<MMAMatrixOf<[F16, F32]>>:$opC);		Arg<MMAMatrixOf<[F16, F32]>>:$opC,
		OptionalAttr<UnitAttr>:$a_transpose,
		OptionalAttr<UnitAttr>:$b_transpose);

let results = (outs GPU_MMAMatrix:$res);		let results = (outs GPU_MMAMatrix : $res);

let assemblyFormat = [{		let assemblyFormat = [{
$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB) `->` type($res)		$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB) `->` type($res)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

def GPU_SubgroupMmaConstantMatrixOp : GPU_Op<"subgroup_mma_constant_matrix",		def GPU_SubgroupMmaConstantMatrixOp : GPU_Op<"subgroup_mma_constant_matrix",
[Pure,		[Pure,
TypesMatchWith<"value type matches element type of mma_matrix",		TypesMatchWith<"value type matches element type of mma_matrix",
"res", "value",		"res", "value",
"$_self.cast<gpu::MMAMatrixType>().getElementType()">]>{		"$_self.cast<gpu::MMAMatrixType>().getElementType()">]>{

let summary = "GPU warp synchronous constant matrix";		let summary = "GPU warp synchronous constant matrix";

let description = [{		let description = [{
The `gpu.subgroup_mma_constant_matrix` creates a `!gpu.mma_matrix` with		The `gpu.subgroup_mma_constant_matrix` creates a `!gpu.mma_matrix` with
constant elements.		constant elements.

The operation takes a scalar input and return a `!gpu.mma_matrix` where each		The operation takes a scalar input and return a `!gpu.mma_matrix` where
element of is equal to the operand constant. The destination mma_matrix type		each element of is equal to the operand constant. The destination
must have elememt type equal to the constant type. Since the layout of		mma_matrix type must have elememt type equal to the constant type. Since
`!gpu.mma_matrix` is opaque this only support setting all the elements to		the layout of `!gpu.mma_matrix` is opaque this only support setting all the
the same value.		elements to the same value.

		ThomasRaouxUnsubmitted Done Reply Inline Actions This doc seems to be at the wrong spot? ThomasRaoux: This doc seems to be at the wrong spot?
		bondhugulaUnsubmitted Done Reply Inline Actions Doc added at the wrong place. bondhugula: Doc added at the wrong place.
This op is meant to be used along with `gpu.subgroup_mma_compute`.		This op is meant to be used along with `gpu.subgroup_mma_compute`.

Example:		Example:

```mlir		```mlir
%0 = gpu.subgroup_mma_constant_matrix %a :		%0 = gpu.subgroup_mma_constant_matrix %a :
!gpu.mma_matrix<16x16xf16, "AOp">		!gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_constant_matrix %b :		%1 = gpu.subgroup_mma_constant_matrix %b :
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	struct WmmaLoadOpToNVVMLowering

LogicalResult		LogicalResult
matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,		matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
OpAdaptor adaptor,		OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Operation *op = subgroupMmaLoadMatrixOp.getOperation();		Operation *op = subgroupMmaLoadMatrixOp.getOperation();
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))
return failure();		return failure();

// TODO: Support transposed mma loads.
if (subgroupMmaLoadMatrixOp.getTranspose())
return failure();

// Get the shape of the MMAMatrix type being returned. The shape will		// Get the shape of the MMAMatrix type being returned. The shape will
		bondhugulaUnsubmitted Done Reply Inline Actions Please use a ternary operator - more readable and compact. bondhugula: Please use a ternary operator - more readable and compact.
// choose which intrinsic this op will be lowered to.		// choose which intrinsic this op will be lowered to.
		NVVM::MMALayout layout = subgroupMmaLoadMatrixOp.getTranspose()
		? NVVM::MMALayout::col
		: NVVM::MMALayout::row;
gpu::MMAMatrixType retType =		gpu::MMAMatrixType retType =
subgroupMmaLoadMatrixOp.getRes().getType().cast<gpu::MMAMatrixType>();		subgroupMmaLoadMatrixOp.getRes().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> retTypeShape = retType.getShape();		ArrayRef<int64_t> retTypeShape = retType.getShape();
int64_t m = 0;		int64_t m = 0;
int64_t n = 0;		int64_t n = 0;
int64_t k = 0;		int64_t k = 0;
NVVM::MMATypes eltype = getElementType(retType);		NVVM::MMATypes eltype = getElementType(retType);
// NVVM intrinsics require to give mxnxk dimensions, infer the missing		// NVVM intrinsics require to give mxnxk dimensions, infer the missing
// dimension based on the valid intrinsics available.		// dimension based on the valid intrinsics available.
if (retType.getOperand().equals("AOp")) {		if (retType.getOperand().equals("AOp")) {
m = retTypeShape[0];		m = retTypeShape[0];
k = retTypeShape[1];		k = retTypeShape[1];
n = NVVM::WMMALoadOp::inferNDimension(m, k, eltype);		n = NVVM::WMMALoadOp::inferNDimension(m, k, eltype);
} else if (retType.getOperand().equals("BOp")) {		} else if (retType.getOperand().equals("BOp")) {
k = retTypeShape[0];		k = retTypeShape[0];
n = retTypeShape[1];		n = retTypeShape[1];
m = NVVM::WMMALoadOp::inferMDimension(k, n, eltype);		m = NVVM::WMMALoadOp::inferMDimension(k, n, eltype);
} else if (retType.getOperand().equals("COp")) {		} else if (retType.getOperand().equals("COp")) {
m = retTypeShape[0];		m = retTypeShape[0];
n = retTypeShape[1];		n = retTypeShape[1];
k = NVVM::WMMALoadOp::inferKDimension(m, n, eltype);		k = NVVM::WMMALoadOp::inferKDimension(m, n, eltype);
}		}
NVVM::MMALayout layout = NVVM::MMALayout::row;
NVVM::MMAFrag frag = convertOperand(retType.getOperand());		NVVM::MMAFrag frag = convertOperand(retType.getOperand());
// Check that there is an exisiting instruction for the combination we need.		// Check that there is an exisiting instruction for the combination we need.
if (NVVM::WMMALoadOp::getIntrinsicID(m, n, k, layout, eltype, frag) == 0)		if (NVVM::WMMALoadOp::getIntrinsicID(m, n, k, layout, eltype, frag) == 0)
return rewriter.notifyMatchFailure(op, kInvalidCaseStr);		return rewriter.notifyMatchFailure(op, kInvalidCaseStr);

Type resType = convertMMAToLLVMType(retType);		Type resType = convertMMAToLLVMType(retType);
Location loc = op->getLoc();		Location loc = op->getLoc();

Show All 32 Lines	matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
Location loc = op->getLoc();		Location loc = op->getLoc();

SmallVector<Value, 4> storeOpOperands;		SmallVector<Value, 4> storeOpOperands;
// Get the shape of the MMAMatrix type being stored. The shape will		// Get the shape of the MMAMatrix type being stored. The shape will
// choose which intrinsic this op will be lowered to.		// choose which intrinsic this op will be lowered to.
gpu::MMAMatrixType srcType =		gpu::MMAMatrixType srcType =
subgroupMmaStoreMatrixOp.getSrc().getType().cast<gpu::MMAMatrixType>();		subgroupMmaStoreMatrixOp.getSrc().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> srcTypeShape = srcType.getShape();		ArrayRef<int64_t> srcTypeShape = srcType.getShape();
NVVM::MMALayout layout = NVVM::MMALayout::row;		NVVM::MMALayout layout = subgroupMmaStoreMatrixOp.getTranspose()
		? NVVM::MMALayout::col
		ThomasRaouxUnsubmitted Done Reply Inline Actions Please also change the lowering to SPIRV to avoid miscompile (it's fine to fail the lowering pattern but ignoring the new field is wrong) ThomasRaoux: Please also change the lowering to SPIRV to avoid miscompile (it's fine to fail the lowering…
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
		: NVVM::MMALayout::row;
NVVM::MMATypes eltype = getElementType(srcType);		NVVM::MMATypes eltype = getElementType(srcType);
int64_t m = srcTypeShape[0];		int64_t m = srcTypeShape[0];
int64_t n = srcTypeShape[1];		int64_t n = srcTypeShape[1];
int64_t k = NVVM::WMMAStoreOp::inferKDimension(m, n, eltype);		int64_t k = NVVM::WMMAStoreOp::inferKDimension(m, n, eltype);
if (NVVM::WMMAStoreOp::getIntrinsicID(m, n, k, layout, eltype) == 0)		if (NVVM::WMMAStoreOp::getIntrinsicID(m, n, k, layout, eltype) == 0)
return rewriter.notifyMatchFailure(op, kInvalidCaseStr);		return rewriter.notifyMatchFailure(op, kInvalidCaseStr);

auto matrixType = adaptor.getSrc().getType().cast<LLVM::LLVMStructType>();		auto matrixType = adaptor.getSrc().getType().cast<LLVM::LLVMStructType>();
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	gpu::MMAMatrixType aType =
subgroupMmaComputeOp.getOpA().getType().cast<gpu::MMAMatrixType>();		subgroupMmaComputeOp.getOpA().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> aTypeShape = aType.getShape();		ArrayRef<int64_t> aTypeShape = aType.getShape();
gpu::MMAMatrixType cType =		gpu::MMAMatrixType cType =
subgroupMmaComputeOp.getOpC().getType().cast<gpu::MMAMatrixType>();		subgroupMmaComputeOp.getOpC().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> cTypeShape = cType.getShape();		ArrayRef<int64_t> cTypeShape = cType.getShape();
int64_t m = cTypeShape[0];		int64_t m = cTypeShape[0];
int64_t n = cTypeShape[1];		int64_t n = cTypeShape[1];
int64_t k = aTypeShape[1];		int64_t k = aTypeShape[1];
NVVM::MMALayout layout = NVVM::MMALayout::row;		NVVM::MMALayout aLayout = subgroupMmaComputeOp.getATranspose()
		? NVVM::MMALayout::col
		: NVVM::MMALayout::row;
		NVVM::MMALayout bLayout = subgroupMmaComputeOp.getBTranspose()
		? NVVM::MMALayout::col
		: NVVM::MMALayout::row;
NVVM::MMATypes sourceType = getElementType(aType);		NVVM::MMATypes sourceType = getElementType(aType);
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
NVVM::MMATypes destType = getElementType(cType);		NVVM::MMATypes destType = getElementType(cType);
if (NVVM::WMMAMmaOp::getIntrinsicID(m, n, k, layout, layout, sourceType,		if (NVVM::WMMAMmaOp::getIntrinsicID(m, n, k, aLayout, bLayout, sourceType,
destType) == 0)		destType) == 0)
return rewriter.notifyMatchFailure(op, kInvalidCaseStr);		return rewriter.notifyMatchFailure(op, kInvalidCaseStr);

unpackOp(adaptor.getOpA());		unpackOp(adaptor.getOpA());
unpackOp(adaptor.getOpB());		unpackOp(adaptor.getOpB());
unpackOp(adaptor.getOpC());		unpackOp(adaptor.getOpC());

rewriter.replaceOpWithNewOp<NVVM::WMMAMmaOp>(		rewriter.replaceOpWithNewOp<NVVM::WMMAMmaOp>(
op, adaptor.getOpC().getType(), m, n, k, layout, layout, sourceType,		op, adaptor.getOpC().getType(), m, n, k, aLayout, bLayout, sourceType,
destType, unpackedOps);		destType, unpackedOps);
return success();		return success();
}		}
};		};

/// Convert GPU MMA ConstantMatrixOp to a chain of InsertValueOp.		/// Convert GPU MMA ConstantMatrixOp to a chain of InsertValueOp.
struct WmmaConstantOpToNVVMLowering		struct WmmaConstantOpToNVVMLowering
: public ConvertOpToLLVMPattern<gpu::SubgroupMmaConstantMatrixOp> {		: public ConvertOpToLLVMPattern<gpu::SubgroupMmaConstantMatrixOp> {
▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToSPIRV/WmmaOpsToSPIRV.cpp

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
Value bufferPtr = spirv::getElementPtr(		Value bufferPtr = spirv::getElementPtr(
*getTypeConverter<SPIRVTypeConverter>(), memrefType,		*getTypeConverter<SPIRVTypeConverter>(), memrefType,
adaptor.getSrcMemref(), adaptor.getIndices(), loc, rewriter);		adaptor.getSrcMemref(), adaptor.getIndices(), loc, rewriter);
auto coopType = convertMMAToSPIRVType(retType);		auto coopType = convertMMAToSPIRVType(retType);
int64_t stride = subgroupMmaLoadMatrixOp.getLeadDimension().getSExtValue();		int64_t stride = subgroupMmaLoadMatrixOp.getLeadDimension().getSExtValue();
auto i32Type = rewriter.getI32Type();		auto i32Type = rewriter.getI32Type();
auto strideValue = rewriter.create<spirv::ConstantOp>(		auto strideValue = rewriter.create<spirv::ConstantOp>(
loc, i32Type, IntegerAttr::get(i32Type, stride));		loc, i32Type, IntegerAttr::get(i32Type, stride));
bool useColMajor =		bool isColMajor = static_cast<bool>(subgroupMmaLoadMatrixOp.getTranspose());
static_cast<bool>(subgroupMmaLoadMatrixOp.getTranspose());
auto columnMajor = rewriter.create<spirv::ConstantOp>(		auto columnMajor = rewriter.create<spirv::ConstantOp>(
loc, rewriter.getI1Type(), rewriter.getBoolAttr(useColMajor));		loc, rewriter.getI1Type(), rewriter.getBoolAttr(isColMajor));
rewriter.replaceOpWithNewOp<spirv::NVCooperativeMatrixLoadOp>(		rewriter.replaceOpWithNewOp<spirv::NVCooperativeMatrixLoadOp>(
subgroupMmaLoadMatrixOp, coopType, bufferPtr, strideValue, columnMajor,		subgroupMmaLoadMatrixOp, coopType, bufferPtr, strideValue, columnMajor,
spirv::MemoryAccessAttr());		spirv::MemoryAccessAttr());
return success();		return success();
}		}
};		};

/// This class implements the conversion of GPU MMA StoreOp to		/// This class implements the conversion of GPU MMA StoreOp to
Show All 10 Lines	matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
auto memrefType =		auto memrefType =
subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>();		subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>();
Value bufferPtr = spirv::getElementPtr(		Value bufferPtr = spirv::getElementPtr(
*getTypeConverter<SPIRVTypeConverter>(), memrefType,		*getTypeConverter<SPIRVTypeConverter>(), memrefType,
adaptor.getDstMemref(), adaptor.getIndices(), loc, rewriter);		adaptor.getDstMemref(), adaptor.getIndices(), loc, rewriter);
int64_t stride = subgroupMmaStoreMatrixOp.getLeadDimension().getSExtValue();		int64_t stride = subgroupMmaStoreMatrixOp.getLeadDimension().getSExtValue();
auto i32Type = rewriter.getI32Type();		auto i32Type = rewriter.getI32Type();
auto strideValue = rewriter.create<spirv::ConstantOp>(		auto strideValue = rewriter.create<spirv::ConstantOp>(
loc, i32Type, IntegerAttr::get(i32Type, stride));		loc, i32Type, IntegerAttr::get(i32Type, stride));
auto coloumnMajor = rewriter.create<spirv::ConstantOp>(		bool useColMajor =
		bondhugulaUnsubmitted Done Reply Inline Actions Can fix typo in variable name while on this. bondhugula: Can fix typo in variable name while on this.
		bondhugulaUnsubmitted Done Reply Inline Actions useColMajor -> isColumnMajor to be consistent with the one below. bondhugula: useColMajor -> isColumnMajor to be consistent with the one below.
loc, rewriter.getI1Type(), rewriter.getBoolAttr(false));		static_cast<bool>(subgroupMmaStoreMatrixOp.getTranspose());
		auto columnMajor = rewriter.create<spirv::ConstantOp>(
		loc, rewriter.getI1Type(), rewriter.getBoolAttr(useColMajor));
rewriter.replaceOpWithNewOp<spirv::NVCooperativeMatrixStoreOp>(		rewriter.replaceOpWithNewOp<spirv::NVCooperativeMatrixStoreOp>(
subgroupMmaStoreMatrixOp, bufferPtr, adaptor.getSrc(), strideValue,		subgroupMmaStoreMatrixOp, bufferPtr, adaptor.getSrc(), strideValue,
coloumnMajor, spirv::MemoryAccessAttr());		columnMajor, spirv::MemoryAccessAttr());
return success();		return success();
}		}
};		};

/// This class implements the conversion of GPU MMA Compute to		/// This class implements the conversion of GPU MMA Compute to
/// CooperativeMatrixMulAdd op in the SPIRV dialect.		/// CooperativeMatrixMulAdd op in the SPIRV dialect.
struct WmmaMmaOpToSPIRVLowering		struct WmmaMmaOpToSPIRVLowering
: public OpConversionPattern<gpu::SubgroupMmaComputeOp> {		: public OpConversionPattern<gpu::SubgroupMmaComputeOp> {
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

	Show First 20 Lines • Show All 467 Lines • ▼ Show 20 Lines
	static void convertTransferWriteOp(vector::TransferWriteOp op,			static void convertTransferWriteOp(vector::TransferWriteOp op,
	llvm::DenseMap<Value, Value> &valueMapping) {			llvm::DenseMap<Value, Value> &valueMapping) {
	assert(transferWriteSupportsMMAMatrixType(op));			assert(transferWriteSupportsMMAMatrixType(op));
	std::optional<int64_t> stride =			std::optional<int64_t> stride =
	getMemrefConstantHorizontalStride(op.getShapedType());			getMemrefConstantHorizontalStride(op.getShapedType());
	assert(stride);			assert(stride);
	OpBuilder b(op);			OpBuilder b(op);
	Value matrix = valueMapping.find(op.getVector())->second;			Value matrix = valueMapping.find(op.getVector())->second;
	b.create<gpu::SubgroupMmaStoreMatrixOp>(op.getLoc(), matrix, op.getSource(),			b.create<gpu::SubgroupMmaStoreMatrixOp>(
	op.getIndices(),			op.getLoc(), matrix, op.getSource(), op.getIndices(),
	b.getIndexAttr(*stride));			b.getIndexAttr(stride), /transpose=*/UnitAttr());
				bondhugulaUnsubmitted Done Reply Inline Actions Use `/arg=../ style for the last one. bondhugula: Use `/arg=../ style for the last one.
	op.erase();			op.erase();
	}			}

	/// Returns the vector type which represents a matrix fragment.			/// Returns the vector type which represents a matrix fragment.
	static VectorType			static VectorType
	getMmaSyncVectorOperandType(const nvgpu::FragmentElementInfo &regInfo) {			getMmaSyncVectorOperandType(const nvgpu::FragmentElementInfo &regInfo) {
	SmallVector<int64_t> shape{regInfo.numRegistersPerFragment,			SmallVector<int64_t> shape{regInfo.numRegistersPerFragment,
	regInfo.elementsPerRegister};			regInfo.elementsPerRegister};
	▲ Show 20 Lines • Show All 308 Lines • ▼ Show 20 Lines
	}			}

	static void convertContractOp(vector::ContractionOp op,			static void convertContractOp(vector::ContractionOp op,
	llvm::DenseMap<Value, Value> &valueMapping) {			llvm::DenseMap<Value, Value> &valueMapping) {
	OpBuilder b(op);			OpBuilder b(op);
	Value opA = valueMapping.find(op.getLhs())->second;			Value opA = valueMapping.find(op.getLhs())->second;
	Value opB = valueMapping.find(op.getRhs())->second;			Value opB = valueMapping.find(op.getRhs())->second;
	Value opC = valueMapping.find(op.getAcc())->second;			Value opC = valueMapping.find(op.getAcc())->second;
	Value matmul = b.create<gpu::SubgroupMmaComputeOp>(op.getLoc(), opC.getType(),			Value matmul = b.create<gpu::SubgroupMmaComputeOp>(
	opA, opB, opC);			op.getLoc(), opC.getType(), opA, opB, opC, /a_transpose=/UnitAttr(),
				bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
				/b_transpose=/UnitAttr());
	valueMapping[op.getResult()] = matmul;			valueMapping[op.getResult()] = matmul;
	}			}

	static LogicalResult			static LogicalResult
	convertContractOpToMmaSync(vector::ContractionOp op,			convertContractOpToMmaSync(vector::ContractionOp op,
	llvm::DenseMap<Value, Value> &valueMapping) {			llvm::DenseMap<Value, Value> &valueMapping) {
	OpBuilder b(op);			OpBuilder b(op);
	Value opA = valueMapping.find(op.getLhs())->second;			Value opA = valueMapping.find(op.getLhs())->second;
	▲ Show 20 Lines • Show All 231 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

	// RUN: mlir-opt --convert-gpu-to-nvvm --split-input-file %s \| FileCheck %s			// RUN: mlir-opt --convert-gpu-to-nvvm --split-input-file %s \| FileCheck %s
	// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck --check-prefix=CHECK32 %s			// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck --check-prefix=CHECK32 %s

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_load_op() ->			// CHECK-LABEL: func @gpu_wmma_load_op() ->
	// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32-LABEL: func @gpu_wmma_load_op() ->			// CHECK32-LABEL: func @gpu_wmma_load_op() ->
	func.func @gpu_wmma_load_op() -> (!gpu.mma_matrix<16x16xf16, "AOp">) {			func.func @gpu_wmma_load_op() -> (!gpu.mma_matrix<16x16xf16, "AOp">) {
	%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>			%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
	%i = arith.constant 16 : index			%i = arith.constant 16 : index
	%j = arith.constant 16 : index			%j = arith.constant 16 : index
	%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">			%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				ThomasRaouxUnsubmitted Done Reply Inline Actions we should also test the non transposed case. Is it somewhere? ThomasRaoux: we should also test the non transposed case. Is it somewhere?
	// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64			// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64
	// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>			// CHECK: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64			// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64
	// CHECK: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i64			// CHECK: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i64
	// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64			// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64
	// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>			// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>
	// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK: %[[FRAG:.*]] = nvvm.wmma.load %[[ADDRESS]], %[[LDM32]]			// CHECK: %[[FRAG:.*]] = nvvm.wmma.load %[[ADDRESS]], %[[LDM32]]
	// CHECK-SAME: {eltype = #nvvm.mma_type<f16>, frag = #nvvm.mma_frag<a>, k = 16 : i32, layout = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : (!llvm.ptr<f16, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK-SAME: {eltype = #nvvm.mma_type<f16>, frag = #nvvm.mma_frag<a>, k = 16 : i32, layout = #nvvm.mma_layout<col>, m = 16 : i32, n = 16 : i32} : (!llvm.ptr<f16, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

	// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32			// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
	// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK32: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK32: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
	// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK32: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i32			// CHECK32: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i32
	// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32			// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
	// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>			// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
	// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK32: %[[FRAG:.*]] = nvvm.wmma.load %[[ADDRESS]], %[[LDM32]]			// CHECK32: %[[FRAG:.*]] = nvvm.wmma.load %[[ADDRESS]], %[[LDM32]]
	// CHECK32-SAME: {eltype = #nvvm.mma_type<f16>, frag = #nvvm.mma_frag<a>, k = 16 : i32, layout = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : (!llvm.ptr<f16, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32-SAME: {eltype = #nvvm.mma_type<f16>, frag = #nvvm.mma_frag<a>, k = 16 : i32, layout = #nvvm.mma_layout<col>, m = 16 : i32, n = 16 : i32} : (!llvm.ptr<f16, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	return %0 : !gpu.mma_matrix<16x16xf16, "AOp">			return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
	}			}
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_store_op			// CHECK-LABEL: func @gpu_wmma_store_op
	// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)			// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
	// CHECK32-LABEL: func @gpu_wmma_store_op			// CHECK32-LABEL: func @gpu_wmma_store_op
	// CHECK32-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)			// CHECK32-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
	func.func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {			func.func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
	%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>			%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
	%i = arith.constant 16 : index			%i = arith.constant 16 : index
	%j = arith.constant 16 : index			%j = arith.constant 16 : index
	gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>			gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index, transpose} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
	// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64			// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64
	// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK: %[[EL1:.*]] = llvm.extractvalue %[[D]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL1:.*]] = llvm.extractvalue %[[D]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL2:.*]] = llvm.extractvalue %[[D]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL2:.*]] = llvm.extractvalue %[[D]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL3:.*]] = llvm.extractvalue %[[D]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL3:.*]] = llvm.extractvalue %[[D]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL4:.*]] = llvm.extractvalue %[[D]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL4:.*]] = llvm.extractvalue %[[D]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>			// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64			// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64
	// CHECK: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i64			// CHECK: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i64
	// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64			// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64
	// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>			// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>
	// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK: nvvm.wmma.store %[[ADDRESS]], %[[LDM32]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]]			// CHECK: nvvm.wmma.store %[[ADDRESS]], %[[LDM32]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]]
	// CHECK-SAME: {eltype = #nvvm.mma_type<f16>, k = 16 : i32, layout = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : !llvm.ptr<f16, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>			// CHECK-SAME: {eltype = #nvvm.mma_type<f16>, k = 16 : i32, layout = #nvvm.mma_layout<col>, m = 16 : i32, n = 16 : i32} : !llvm.ptr<f16, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>
	// CHECK: llvm.return			// CHECK: llvm.return

	// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32			// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
	// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK32: %[[EL1:.*]] = llvm.extractvalue %[[D]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32: %[[EL1:.*]] = llvm.extractvalue %[[D]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32: %[[EL2:.*]] = llvm.extractvalue %[[D]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32: %[[EL2:.*]] = llvm.extractvalue %[[D]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32: %[[EL3:.*]] = llvm.extractvalue %[[D]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32: %[[EL3:.*]] = llvm.extractvalue %[[D]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32: %[[EL4:.*]] = llvm.extractvalue %[[D]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK32: %[[EL4:.*]] = llvm.extractvalue %[[D]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK32: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK32: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
	// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK32: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i32			// CHECK32: %[[LI:.*]] = llvm.mul %[[INX]], %[[LDM]] : i32
	// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32			// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
	// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>			// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJ]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
	// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
	// CHECK32: nvvm.wmma.store %[[ADDRESS]], %[[LDM32]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]]			// CHECK32: nvvm.wmma.store %[[ADDRESS]], %[[LDM32]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]]
	// CHECK32-SAME: {eltype = #nvvm.mma_type<f16>, k = 16 : i32, layout = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : !llvm.ptr<f16, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>			// CHECK32-SAME: {eltype = #nvvm.mma_type<f16>, k = 16 : i32, layout = #nvvm.mma_layout<col>, m = 16 : i32, n = 16 : i32} : !llvm.ptr<f16, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>
	// CHECK32: llvm.return			// CHECK32: llvm.return
	return			return
	}			}
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_mma_op			// CHECK-LABEL: func @gpu_wmma_mma_op
	// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)			// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
	func.func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> (!gpu.mma_matrix<16x16xf16, "COp">) {			func.func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> (!gpu.mma_matrix<16x16xf16, "COp">) {
	%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">			%D = gpu.subgroup_mma_compute %A, %B, %C {a_transpose} : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
				bondhugulaUnsubmitted Done Reply Inline Actions As a convention, attributes like these should use snake case: `a_transpose`? bondhugula: As a convention, attributes like these should use snake case: `a_transpose`?
	// CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A2:.*]] = llvm.extractvalue %[[A]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A2:.*]] = llvm.extractvalue %[[A]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A3:.*]] = llvm.extractvalue %[[A]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A3:.*]] = llvm.extractvalue %[[A]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A4:.*]] = llvm.extractvalue %[[A]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A4:.*]] = llvm.extractvalue %[[A]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A5:.*]] = llvm.extractvalue %[[A]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A5:.*]] = llvm.extractvalue %[[A]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A6:.*]] = llvm.extractvalue %[[A]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A6:.*]] = llvm.extractvalue %[[A]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A7:.*]] = llvm.extractvalue %[[A]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A7:.*]] = llvm.extractvalue %[[A]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A8:.*]] = llvm.extractvalue %[[A]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A8:.*]] = llvm.extractvalue %[[A]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B1:.*]] = llvm.extractvalue %[[B]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B1:.*]] = llvm.extractvalue %[[B]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B2:.*]] = llvm.extractvalue %[[B]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B2:.*]] = llvm.extractvalue %[[B]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B3:.*]] = llvm.extractvalue %[[B]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B3:.*]] = llvm.extractvalue %[[B]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B4:.*]] = llvm.extractvalue %[[B]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B4:.*]] = llvm.extractvalue %[[B]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B5:.*]] = llvm.extractvalue %[[B]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B5:.*]] = llvm.extractvalue %[[B]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B6:.*]] = llvm.extractvalue %[[B]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B6:.*]] = llvm.extractvalue %[[B]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B7:.*]] = llvm.extractvalue %[[B]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B7:.*]] = llvm.extractvalue %[[B]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B8:.*]] = llvm.extractvalue %[[B]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B8:.*]] = llvm.extractvalue %[[B]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[C1:.*]] = llvm.extractvalue %[[C]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[C1:.*]] = llvm.extractvalue %[[C]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[C2:.*]] = llvm.extractvalue %[[C]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[C2:.*]] = llvm.extractvalue %[[C]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[C3:.*]] = llvm.extractvalue %[[C]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[C3:.*]] = llvm.extractvalue %[[C]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[C4:.*]] = llvm.extractvalue %[[C]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[C4:.*]] = llvm.extractvalue %[[C]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[RES:.*]] = nvvm.wmma.mma %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]]			// CHECK: %[[RES:.*]] = nvvm.wmma.mma %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]]
	// CHECK-SAME: {eltypeA = #nvvm.mma_type<f16>, eltypeB = #nvvm.mma_type<f16>, k = 16 : i32, layoutA = #nvvm.mma_layout<row>, layoutB = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : (			// CHECK-SAME: {eltypeA = #nvvm.mma_type<f16>, eltypeB = #nvvm.mma_type<f16>, k = 16 : i32, layoutA = #nvvm.mma_layout<col>, layoutB = #nvvm.mma_layout<row>, m = 16 : i32, n = 16 : i32} : (
	// CHECK-SAME: vector<2xf16>, {{.*}}) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK-SAME: vector<2xf16>, {{.*}}) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: llvm.return %[[RES]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: llvm.return %[[RES]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	return %D : !gpu.mma_matrix<16x16xf16, "COp">			return %D : !gpu.mma_matrix<16x16xf16, "COp">
	}			}
	}			}

	// -----			// -----

	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToSPIRV/wmma-ops-to-spirv.mlir

// RUN: mlir-opt -allow-unregistered-dialect -convert-gpu-to-spirv -split-input-file -verify-diagnostics %s \| FileCheck %s		// RUN: mlir-opt -allow-unregistered-dialect -convert-gpu-to-spirv -split-input-file -verify-diagnostics %s \| FileCheck %s

module attributes {		module attributes {
gpu.container_module,		gpu.container_module,
spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {		spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {
gpu.module @kernels {		gpu.module @kernels {
// CHECK: spirv.module @{{.*}} Logical GLSL450 {		// CHECK: spirv.module @{{.*}} Logical GLSL450 {
// CHECK-LABEL: spirv.func @gpu_wmma_load_op		// CHECK-LABEL: spirv.func @gpu_wmma_load_op
// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}		// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}
// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>		// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>
gpu.func @gpu_wmma_load_op(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>) kernel		gpu.func @gpu_wmma_load_op(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>) kernel
attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {		attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {
%i = arith.constant 16 : index		%i = arith.constant 16 : index
%j = arith.constant 16 : index		%j = arith.constant 16 : index
// CHECK: {{%.}} = spirv.NV.CooperativeMatrixLoad {{%.}}, {{%.}}, {{%.}} : !spirv.ptr<f32, StorageBuffer> as !spirv.coopmatrix<16x16xf16, Subgroup>		// CHECK: %[[COLMAJOR:.*]] = spirv.Constant false
		// CHECK: {{%.}} = spirv.NV.CooperativeMatrixLoad {{%.}}, {{%.*}}, %[[COLMAJOR]] : !spirv.ptr<f32, StorageBuffer> as !spirv.coopmatrix<16x16xf16, Subgroup>
%0 = gpu.subgroup_mma_load_matrix %arg0[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, #spirv.storage_class<StorageBuffer>> -> !gpu.mma_matrix<16x16xf16, "COp">		%0 = gpu.subgroup_mma_load_matrix %arg0[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, #spirv.storage_class<StorageBuffer>> -> !gpu.mma_matrix<16x16xf16, "COp">
// CHECK: spirv.Return		// CHECK: spirv.Return
gpu.return		gpu.return
}		}
}		}
}		}

// -----		// -----

module attributes {		module attributes {
gpu.container_module,		gpu.container_module,
spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {		spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {
gpu.module @kernels {		gpu.module @kernels {
// CHECK: spirv.module @{{.*}} Logical GLSL450 {		// CHECK: spirv.module @{{.*}} Logical GLSL450 {
		// CHECK-LABEL: spirv.func @gpu_wmma_load_op_transpose
		// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}
		// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>
		gpu.func @gpu_wmma_load_op_transpose(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>) kernel
		attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {
		%i = arith.constant 16 : index
		%j = arith.constant 16 : index
		// CHECK: %[[COLMAJOR:.*]] = spirv.Constant true
		// CHECK: {{%.}} = spirv.NV.CooperativeMatrixLoad {{%.}}, {{%.*}}, %[[COLMAJOR]] : !spirv.ptr<f32, StorageBuffer> as !spirv.coopmatrix<16x16xf16, Subgroup>
		%0 = gpu.subgroup_mma_load_matrix %arg0[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, #spirv.storage_class<StorageBuffer>> -> !gpu.mma_matrix<16x16xf16, "COp">
		// CHECK: spirv.Return
		gpu.return
		}
		}
		}

		// -----

		module attributes {
		gpu.container_module,
		spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {
		gpu.module @kernels {
		// CHECK: spirv.module @{{.*}} Logical GLSL450 {
// CHECK-LABEL: spirv.func @gpu_wmma_store_op		// CHECK-LABEL: spirv.func @gpu_wmma_store_op
// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}		// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}
// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 1)>})		// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 1)>})
// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>		// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>
gpu.func @gpu_wmma_store_op(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>, %arg1 : !gpu.mma_matrix<16x16xf16, "COp">) kernel		gpu.func @gpu_wmma_store_op(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>, %arg1 : !gpu.mma_matrix<16x16xf16, "COp">) kernel
attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {		attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {
%i = arith.constant 16 : index		%i = arith.constant 16 : index
%j = arith.constant 16 : index		%j = arith.constant 16 : index
// CHECK: spirv.NV.CooperativeMatrixStore {{%.}}, {{%.}}, {{%.}}, {{%.}} : !spirv.ptr<f32, StorageBuffer>, !spirv.coopmatrix<16x16xf16, Subgroup>		// CHECK: %[[COLMAJOR:.*]] = spirv.Constant false
		// CHECK: spirv.NV.CooperativeMatrixStore {{%.}}, {{%.}}, {{%.*}}, %[[COLMAJOR]] : !spirv.ptr<f32, StorageBuffer>, !spirv.coopmatrix<16x16xf16, Subgroup>
gpu.subgroup_mma_store_matrix %arg1, %arg0[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, #spirv.storage_class<StorageBuffer>>		gpu.subgroup_mma_store_matrix %arg1, %arg0[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, #spirv.storage_class<StorageBuffer>>
// CHECK: spirv.Return		// CHECK: spirv.Return
gpu.return		gpu.return
}		}
}		}
}		}

// -----		// -----

module attributes {		module attributes {
gpu.container_module,		gpu.container_module,
spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {		spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {
gpu.module @kernels {		gpu.module @kernels {
// CHECK: spirv.module @{{.*}} Logical GLSL450 {		// CHECK: spirv.module @{{.*}} Logical GLSL450 {
		// CHECK-LABEL: spirv.func @gpu_wmma_store_op_transpose
		// CHECK-SAME: {{%.*}}: !spirv.ptr<!spirv.struct<(!spirv.array<512 x f32, stride=4> [0])>, StorageBuffer> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}
		// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 1)>})
		// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>
		gpu.func @gpu_wmma_store_op_transpose(%arg0 : memref<32x32xf16, #spirv.storage_class<StorageBuffer>>, %arg1 : !gpu.mma_matrix<16x16xf16, "COp">) kernel
		attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {
		%i = arith.constant 16 : index
		%j = arith.constant 16 : index
		// CHECK: %[[COLMAJOR:.*]] = spirv.Constant true
		// CHECK: spirv.NV.CooperativeMatrixStore {{%.}}, {{%.}}, {{%.*}}, %[[COLMAJOR]] : !spirv.ptr<f32, StorageBuffer>, !spirv.coopmatrix<16x16xf16, Subgroup>
		gpu.subgroup_mma_store_matrix %arg1, %arg0[%i,%j] {leadDimension= 32 : index, transpose} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, #spirv.storage_class<StorageBuffer>>
		// CHECK: spirv.Return
		gpu.return
		}
		}
		}

		// -----

		module attributes {
		gpu.container_module,
		spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Shader, CooperativeMatrixNV, Float16], [SPV_KHR_storage_buffer_storage_class, SPV_NV_cooperative_matrix]>, #spirv.resource_limits<>>} {
		gpu.module @kernels {
		// CHECK: spirv.module @{{.*}} Logical GLSL450 {
// CHECK-LABEL: spirv.func @gpu_wmma_mma_op		// CHECK-LABEL: spirv.func @gpu_wmma_mma_op
// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}		// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 0)>}
// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 1)>}		// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 1)>}
// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 2)>})		// CHECK-SAME: {{%.*}}: !spirv.coopmatrix<16x16xf16, Subgroup> {spirv.interface_var_abi = #spirv.interface_var_abi<(0, 2)>})
// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>		// CHECK-SAME: spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>
gpu.func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) kernel		gpu.func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) kernel
attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {		attributes {spirv.entry_point_abi = #spirv.entry_point_abi<workgroup_size = [32, 4, 1]>} {
// CHECK: {{%.}} = spirv.NV.CooperativeMatrixMulAdd {{%.}}, {{%.}}, {{%.}} : !spirv.coopmatrix<16x16xf16, Subgroup>, !spirv.coopmatrix<16x16xf16, Subgroup> -> !spirv.coopmatrix<16x16xf16, Subgroup>		// CHECK: {{%.}} = spirv.NV.CooperativeMatrixMulAdd {{%.}}, {{%.}}, {{%.}} : !spirv.coopmatrix<16x16xf16, Subgroup>, !spirv.coopmatrix<16x16xf16, Subgroup> -> !spirv.coopmatrix<16x16xf16, Subgroup>
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	gpu.func @gpu_wmma_elementwise_op(%A : !gpu.mma_matrix<16x16xf16, "COp">, %B : !gpu.mma_matrix<16x16xf16, "COp">) kernel
// CHECK: {{%.}} = spirv.FNegate {{%.}} : !spirv.coopmatrix<16x16xf16, Subgroup>		// CHECK: {{%.}} = spirv.FNegate {{%.}} : !spirv.coopmatrix<16x16xf16, Subgroup>
%D = gpu.subgroup_mma_elementwise negatef %C : (!gpu.mma_matrix<16x16xf16, "COp">) -> !gpu.mma_matrix<16x16xf16, "COp">		%D = gpu.subgroup_mma_elementwise negatef %C : (!gpu.mma_matrix<16x16xf16, "COp">) -> !gpu.mma_matrix<16x16xf16, "COp">
// CHECK: {{%.}} = spirv.FDiv {{%.}}, {{%.*}} : !spirv.coopmatrix<16x16xf16, Subgroup>		// CHECK: {{%.}} = spirv.FDiv {{%.}}, {{%.*}} : !spirv.coopmatrix<16x16xf16, Subgroup>
%E = gpu.subgroup_mma_elementwise divf %D, %A : (!gpu.mma_matrix<16x16xf16, "COp">, !gpu.mma_matrix<16x16xf16, "COp">) -> !gpu.mma_matrix<16x16xf16, "COp">		%E = gpu.subgroup_mma_elementwise divf %D, %A : (!gpu.mma_matrix<16x16xf16, "COp">, !gpu.mma_matrix<16x16xf16, "COp">) -> !gpu.mma_matrix<16x16xf16, "COp">
// CHECK: spirv.Return		// CHECK: spirv.Return
gpu.return		gpu.return
}		}
}		}
}		}
No newline at end of file

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

Show All 16 Lines	func.func @main() {

%f1 = arith.constant 1.0e+00 : f16		%f1 = arith.constant 1.0e+00 : f16
%f0 = arith.constant 0.0e+00 : f16		%f0 = arith.constant 0.0e+00 : f16
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
%c16 = arith.constant 16 : index		%c16 = arith.constant 16 : index
%c32 = arith.constant 32 : index		%c32 = arith.constant 32 : index
%c1 = arith.constant 1 : index		%c1 = arith.constant 1 : index

// Intialize the Input matrix with ones.		// Intialize the Input matrix with the column index in each row.
		bondhugulaUnsubmitted Done Reply Inline Actions with the column index bondhugula: with the column index
scf.for %arg0 = %c0 to %c16 step %c1 {		scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {		scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %f1, %0[%arg0, %arg1] : memref<16x16xf16>		%2 = arith.index_cast %arg1 : index to i16
		%3 = arith.sitofp %2 : i16 to f16
		memref.store %3, %0[%arg0, %arg1] : memref<16x16xf16>
		bondhugulaUnsubmitted Done Reply Inline Actions How is this change related/needed? The comment above would now be inaccurate. bondhugula: How is this change related/needed? The comment above would now be inaccurate.
}		}
}		}
// Intialize the accumulator matrix with zeros.		// Intialize the accumulator matrix with zeros.
scf.for %arg0 = %c0 to %c16 step %c1 {		scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {		scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>		memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
}		}
}		}

%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>		%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
%33 = memref.cast %22 : memref<16x16xf16> to memref<*xf16>		%33 = memref.cast %22 : memref<16x16xf16> to memref<*xf16>
%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>		%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>
gpu.host_register %2 : memref<*xf16>		gpu.host_register %2 : memref<*xf16>
gpu.host_register %33 : memref<*xf16>		gpu.host_register %33 : memref<*xf16>

gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)		gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {		threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
%A = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">		%A = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index, transpose} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
%B = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">		%B = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
%C = gpu.subgroup_mma_load_matrix %22[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">		%C = gpu.subgroup_mma_load_matrix %22[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">		%R = gpu.subgroup_mma_compute %A, %B, %C {a_transpose} : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

gpu.subgroup_mma_store_matrix %R, %0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>		gpu.subgroup_mma_store_matrix %R, %0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
gpu.terminator		gpu.terminator
}		}

// Convert the results from f16 to f32 for printing.		// Convert the results from f16 to f32 for printing.
scf.for %arg0 = %c0 to %c16 step %c1 {		scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {		scf.for %arg1 = %c0 to %c16 step %c1 {
%6 = memref.load %0[%arg0, %arg1] : memref<16x16xf16>		%6 = memref.load %0[%arg0, %arg1] : memref<16x16xf16>
%7 = arith.extf %6 : f16 to f32		%7 = arith.extf %6 : f16 to f32
memref.store %7, %1[%arg0, %arg1] : memref<16x16xf32>		memref.store %7, %1[%arg0, %arg1] : memref<16x16xf32>
}		}
}		}

// Print the memref after computation.		// Print the memref after computation.
call @printMemrefF32(%3) : (memref<*xf32>) -> ()		call @printMemrefF32(%3) : (memref<*xf32>) -> ()
// CHECK: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 48, 96, 144, 192, 240, 288, 336, 384, 432, 480, 528, 576, 624, 672, 720],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 704, 768, 832, 896, 960],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 80, 160, 240, 320, 400, 480, 560, 640, 720, 800, 880, 960, 1040, 1120, 1200],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 96, 192, 288, 384, 480, 576, 672, 768, 864, 960, 1056, 1152, 1248, 1344, 1440],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 112, 224, 336, 448, 560, 672, 784, 896, 1008, 1120, 1232, 1344, 1456, 1568, 1680],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 144, 288, 432, 576, 720, 864, 1008, 1152, 1296, 1440, 1584, 1728, 1872, 2016, 2160],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 160, 320, 480, 640, 800, 960, 1120, 1280, 1440, 1600, 1760, 1920, 2080, 2240, 2400],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 176, 352, 528, 704, 880, 1056, 1232, 1408, 1584, 1760, 1936, 2112, 2288, 2464, 2640],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 192, 384, 576, 768, 960, 1152, 1344, 1536, 1728, 1920, 2112, 2304, 2496, 2688, 2880],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 208, 416, 624, 832, 1040, 1248, 1456, 1664, 1872, 2080, 2288, 2496, 2704, 2912, 3120],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],		// CHECK-NEXT: [0, 224, 448, 672, 896, 1120, 1344, 1568, 1792, 2016, 2240, 2464, 2688, 2912, 3136, 3360],
// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]		// CHECK-NEXT: [0, 240, 480, 720, 960, 1200, 1440, 1680, 1920, 2160, 2400, 2640, 2880, 3120, 3360, 3600]]
return		return
}		}

func.func private @printMemrefF32(memref<*xf32>)		func.func private @printMemrefF32(memref<*xf32>)

This is an archive of the discontinued LLVM Phabricator instance.

Support `transpose` mode for `gpu.subgroup` WMMA ops
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 480135

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

mlir/lib/Conversion/GPUToSPIRV/WmmaOpsToSPIRV.cpp

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

mlir/test/Conversion/GPUToSPIRV/wmma-ops-to-spirv.mlir

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

This is an archive of the discontinued LLVM Phabricator instance.

Support `transpose` mode for `gpu.subgroup` WMMA opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 480135

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

mlir/lib/Conversion/GPUToSPIRV/WmmaOpsToSPIRV.cpp

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

mlir/test/Conversion/GPUToSPIRV/wmma-ops-to-spirv.mlir

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

Support `transpose` mode for `gpu.subgroup` WMMA ops
ClosedPublic