This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Conversion/
-
GPUToROCDL/
1/1
GPUToROCDLPass.h
3/3
Passes.td
-
Dialect/LLVMIR/
-
LLVMIR/
-
CMakeLists.txt
-
ROCDLDialect.h
2/2
ROCDLOps.td
-
lib/Conversion/GPUToROCDL/
-
Conversion/
-
GPUToROCDL/
-
CMakeLists.txt
-
LowerGpuOpsToROCDLOps.cpp
6/6
WmmaOpsToROCDL.cpp
-
test/
3/4
CMakeLists.txt
-
Conversion/GPUToROCDL/
-
GPUToROCDL/
-
wmma-ops-to-rocdl-unsupported-chipset.mlir
-
wmma-ops-to-rocdl-unsupported.mlir
-
wmma-ops-to-rocdl.mlir
-
Integration/GPU/ROCM/WMMA/
-
GPU/
-
ROCM/
-
WMMA/
-
lit.local.cfg
-
wmma_f16_16_16_16_f16.mlir
-
wmma_f16_16_16_16_f16_opselect.mlir
-
wmma_f16_16_16_16_f16_x2.mlir
-
wmma_f32_16_16_16_f16.mlir
1/1
wmma_f32_16_16_16_f16_a_b_transpose.mlir
-
lit.site.cfg.py.in

Differential D157228

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect
Needs ReviewPublic

Authored by navdeepkk on Aug 6 2023, 6:39 AM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
dcaballe
nicolasvasilache
herhut
bondhugula
giuseros
krzysz00

Summary

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

navdeepkk created this revision.Aug 6 2023, 6:39 AM

Herald added a reviewer: ftynse. · View Herald TranscriptAug 6 2023, 6:39 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 26 others. · View Herald Transcript

navdeepkk requested review of this revision.Aug 6 2023, 6:39 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 6 2023, 6:39 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk edited the summary of this revision. (Show Details)Aug 6 2023, 6:42 AM

navdeepkk added reviewers: bondhugula, giuseros, krzysz00.

Harbormaster completed remote builds in B250612: Diff 547575.Aug 6 2023, 6:58 AM

In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.
https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it
Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

mlir/test/CMakeLists.txt
34	Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is known to support WMMA"?
mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir
5	Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm integration tests.

This revision now requires changes to proceed.Aug 6 2023, 7:52 AM

(amdgpu.wmma is at https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td#L445 )

In D157228#4564063, @krzysz00 wrote:

In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.

https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it

Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

Thanks for the review. I'll work on addressing 2 and 3. Regarding point one, as GPU dialect is supposed to serve as a common abstraction for both AMD and Nvidia GPUs we can just lower it to the ROCDL intrinsic. I am not sure if the AMD GPU intrinsics will significantly deviate from the GPU dialect ops already present. Say If we decide to go from the GPU dialect to the AMD GPU dialect intrinsic and the AMD GPU dialect intrinsics capture more information, we would have to have that information in the GPU dialect ops somehow as well, Or we could just skip the GPU dialect completely which I don't think will be a good idea as the frontends would need two different mechanisms (possibly sharing significant code) to map to AMD or Nvidia GPU. The current patch enables the users to just generate GPU dialect ops and just lower it to AMD or Nvidia GPUs with the switch of a single pass. Please let me know if there is a stronger motivation or if I am missing something.

Thanks!

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

In D157228#4573344, @krzysz00 wrote:

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

Thanks. This sheds more light on this. I am trying to understand can the GPU dialect ops themselves capture all the information (via attributes) that is needed to lower them to the ROCDL dialect ops (for both WMMA and MFMA). If that is the case then the lowering from GPU dialect to ROCDL will be the single place where all the information/logic will reside. If it is not possible to represent all of the WMMA and MFMA functionality in the GPU dialect ops then we can just refactor this patch into two stages. Another point is that there is no intrinsic for the load and store ops. Will we even want them to be present or just use the GPU dialect intrinsic and lower them to ROCDL/LLVM transparently?

On a separate note, AFAIK there is a path to lower MFMA ops from the GPU dialect to AMD-GPU currently. Do we have a plan to have that support or are the users supposed to AMD-GPU dialect ops directly?

The reason I want to lower the matrix ops from GPU to AMDGPU + memref/vector + ... instead of to ROCDL and LLVM directory is separation of concerns and progressive lowering.

gpu-to-amdgpu lowers GPU matrix operations to MLIR constricts that abstract away some of the fiddler details of the underlying intrinsics (such as how F16 * F16 => F16 doesn't return a vector of F16 (with half its entries updated) but instead a vector of 32-bit values with the low or high halves updated.

Going straight from GPU to ROCDL means you have to copy all the complexity that's present in amdgpu-to-rocdl and I don't want to maintain two copies of the dame code if I don't have to, hence my request to do a two-stage lowering.

And re load and store, unlike on Nvidia, WMMA is just an instruction: loads and stores should be lowered to operations on a private memory attribution (aka an wlloca() ) that replaces each GPU matrix value.

As to MFMA, @sjw36 for more there since he's been poking at that lowering

Just wanted to check on what you're planning here

In D157228#4643220, @krzysz00 wrote:

Just wanted to check on what you're planning here

Hi @krzysz00, I will take all the suggestions and update this patch in a few days.

Address second major comment on the revision. Use gpu.lane_id lowering to get
laneID while converting WMMA ops.

Harbormaster completed remote builds in B257447: Diff 557107.Sep 20 2023, 5:55 AM

Add a separate pass to convert gpu.subgroup_mma_compute op to amdgpu.wmma op

Harbormaster completed remote builds in B257558: Diff 557276.Sep 24 2023, 1:15 AM

Add missing commits

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptSep 24 2023, 1:21 AM

Harbormaster completed remote builds in B257559: Diff 557277.Sep 24 2023, 1:22 AM

Format comment in LowerGPUOpsToAMDGPUOps.cpp

Harbormaster completed remote builds in B257560: Diff 557278.Sep 24 2023, 1:27 AM

@krzysz00 Please review.

I think this is headed in a good direction, I just have some thoughts, mostly about how the pass options were set up

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
74	I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a module attribute that `SerializeToHsaco` picks up on for correct device library linking/option setting. Anything that's requiring reasoning about `opSelect` should be moved up to `gpu-to-amdgpu`. Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values.
mlir/include/mlir/Conversion/Passes.td
515	I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation that uses both halves of the output registers. Alternatively, you could make `amdgpu.opSelect` an attribute on the relevant operations instead of making it a pass-wide knob.
593	This does make some sense to have here, yeah
mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
271	Not sure we'll need these?
mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
85 ↗	(On Diff #557278)	(To clarify, this could be two compute ops, one with opSelect = false and one with opSelect = true)
165 ↗	(On Diff #557278)	Do we need this here? (also, if you'll have this here, you'll want the data layout incantations somewhere in the pass)
mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
143	I'm not sure why these are `*LaneFirst` functions. Could you expand on that?
258	Grammar nit here, perhaps "only size 32 wavefronts are supported"
426	That's too strong a restriction - WMMA is available on all the gfx11NN GPUs
488	There's something suspicious about this being here
mlir/test/CMakeLists.txt
41	Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.
45	This should be a user-configurable option?

Address comments and implement upper/lower half FP16 WMMA

@krzysz00 Please review.

mlir/include/mlir/Conversion/Passes.td
515	Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an `opAttribute` and just generate code with the `opselect` attribute set and unset.
mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
271	These are actually needed currently to check some conditions in the lowerings.
mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
165 ↗	(On Diff #557278)	This can be dropped actually. Thanks.
mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
143	It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing.
426	Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested.
mlir/test/CMakeLists.txt
41	I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right?

Harbormaster completed remote builds in B257830: Diff 557722.Oct 17 2023, 1:59 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUToROCDL/

GPUToROCDLPass.h

38 lines

Passes.td

27 lines

Dialect/

LLVMIR/

CMakeLists.txt

4 lines

ROCDLDialect.h

2 lines

ROCDLOps.td

13 lines

lib/

Conversion/

GPUToROCDL/

CMakeLists.txt

1 line

LowerGpuOpsToROCDLOps.cpp

38 lines

WmmaOpsToROCDL.cpp

512 lines

test/

CMakeLists.txt

3 lines

Conversion/

GPUToROCDL/

wmma-ops-to-rocdl-unsupported-chipset.mlir

30 lines

wmma-ops-to-rocdl-unsupported.mlir

181 lines

wmma-ops-to-rocdl.mlir

442 lines

Integration/

GPU/

ROCM/

WMMA/

lit.local.cfg

5 lines

wmma_f16_16_16_16_f16.mlir

95 lines

wmma_f16_16_16_16_f16_opselect.mlir

95 lines

wmma_f16_16_16_16_f16_x2.mlir

100 lines

wmma_f32_16_16_16_f16.mlir

86 lines

wmma_f32_16_16_16_f16_a_b_transpose.mlir

84 lines

lit.site.cfg.py.in

1 line

Diff 557722

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

	Show All 12 Lines
	#include "mlir/Dialect/Arith/IR/Arith.h"			#include "mlir/Dialect/Arith/IR/Arith.h"
	#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"			#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/Value.h"			#include "mlir/IR/Value.h"
	#include "mlir/Transforms/DialectConversion.h"			#include "mlir/Transforms/DialectConversion.h"
	#include <memory>			#include <memory>

				namespace llvm {
				class StringRef;
				} // namespace llvm

	namespace mlir {			namespace mlir {
	class LLVMTypeConverter;			class LLVMTypeConverter;
	class ConversionTarget;			class ConversionTarget;
	class OpBuilder;			class OpBuilder;
	class Location;			class Location;
	class RewritePatternSet;			class RewritePatternSet;
				class Type;

	template <typename OpT>			template <typename OpT>
	class OperationPass;			class OperationPass;

	namespace gpu {			namespace gpu {
	class GPUModuleOp;			class GPUModuleOp;
				class MMAMatrixType;
	} // namespace gpu			} // namespace gpu

	#define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS			#define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS
	#include "mlir/Conversion/Passes.h.inc"			#include "mlir/Conversion/Passes.h.inc"

	namespace amd {			namespace amd {
	/// Constant representing 32 workitems in a workgroup.			/// Constant representing 32 workitems in a workgroup.
	const unsigned kWaveFrontSize32 = 32;			const unsigned kWaveFrontSize32 = 32;

	/// Constant representing 64 workitems in a workgroup.			/// Constant representing 64 workitems in a workgroup.
	const unsigned kWaveFrontSize64 = 64;			const unsigned kWaveFrontSize64 = 64;

	/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.			/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.
	const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,			const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,
	kWaveFrontSize64};			kWaveFrontSize64};

	/// Generate ops to get the laneId of the current lane and return it.			/// Generate ops to get the laneId of the current lane and return it.
	Value getLaneId(ConversionPatternRewriter &rewriter, Location loc,			Value getLaneId(PatternRewriter &rewriter, Location loc,
	unsigned indexBitwidth);			unsigned indexBitwidth);

	/// Return the LLVM Type corresponding to the MMAMatrixType.			/// Return the LLVM Type corresponding to the MMAMatrixType.
	Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);			Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);
	} // namespace amd			} // namespace amd

	/// Collect a set of patterns to convert from the GPU dialect to ROCDL.			/// Collect a set of patterns to convert from the GPU dialect to ROCDL.
	/// If `runtime` is Unknown, gpu.printf will not be lowered			/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
	/// The resulting pattern set should be run over a gpu.module op			/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
	void populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,			/// targeting. `indexBitwidth` is the bitwidth to be used while converting index
	RewritePatternSet &patterns,			/// types. `warpSize` is the warp size to use when generating WMMA intrinsics.
	gpu::amd::Runtime runtime);			void populateGpuToROCDLConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns,
				gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
				unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
				unsigned warpSize = 32);

	/// Configure target to convert from the GPU dialect to ROCDL.			/// Configure target to convert from the GPU dialect to ROCDL.
				krzysz00Unsubmitted Done Reply Inline Actions I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a module attribute that `SerializeToHsaco` picks up on for correct device library linking/option setting. Anything that's requiring reasoning about `opSelect` should be moved up to `gpu-to-amdgpu`. Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values. krzysz00: I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a…
	void configureGpuToROCDLConversionLegality(ConversionTarget &target);			void configureGpuToROCDLConversionLegality(ConversionTarget &target);

	/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The			/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The
	/// index bitwidth used for the lowering of the device side index computations			/// index bitwidth used for the lowering of the device side index computations
	/// is configurable.			/// is configurable. AMD gpus have a configurable warp size; valid choices are
				/// 32 and 64. We choose 32 as the default size.
	std::unique_ptr<OperationPass<gpu::GPUModuleOp>>			std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
	createLowerGpuOpsToROCDLOpsPass(			createLowerGpuOpsToROCDLOpsPass(
	const std::string &chipset = "gfx900",			const std::string &chipset = "gfx900",
	unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,			unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
	bool useBarePtrCallConv = false,			bool useBarePtrCallConv = false,
	gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);			gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown,
				unsigned warpSize = 32);

				/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
				/// `chipset` is the target chip for which the IR is being generated.
				/// `indexBitwidth` is the bitwidth to be used while converting index types.
				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				void populateGpuWMMAToROCDLConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns,
				llvm::StringRef chipset = "gfx900",
				unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
				unsigned warpSize = 32);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_			#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 506 Lines • ▼ Show 20 Lines	def ConvertGpuOpsToAMDGPUOps : Pass<"convert-gpu-to-amdgpu", "gpu::GPUModuleOp"> {
];		];
let options = [		let options = [
Option<"chipset", "chipset", "std::string",		Option<"chipset", "chipset", "std::string",
/default=/"\"gfx000\"",		/default=/"\"gfx000\"",
"Chipset that these operations will run on">,		"Chipset that these operations will run on">,
Option<"indexBitwidth", "index-bitwidth", "unsigned",		Option<"indexBitwidth", "index-bitwidth", "unsigned",
/default=kDeriveIndexBitwidthFromDataLayout/ "0",		/default=kDeriveIndexBitwidthFromDataLayout/ "0",
"Bitwidth of the index type, 0 to use size of machine word">,		"Bitwidth of the index type, 0 to use size of machine word">,
Option<"warpSize", "warp-size", "unsigned",		Option<"warpSize", "warp-size", "unsigned",
		krzysz00Unsubmitted Done Reply Inline Actions I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation that uses both halves of the output registers. Alternatively, you could make `amdgpu.opSelect` an attribute on the relevant operations instead of making it a pass-wide knob. krzysz00: I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an `opAttribute` and just generate code with the `opselect` attribute set and unset. navdeepkk: Thanks, I wasn't aware of this use case. I was under the impression that this was the job of…
/default=/"32",		/default=/"32",
"AMD GPUs have a configurable warp size; valid choices are 32 and "		"AMD GPUs have a configurable warp size; valid choices are 32 and "
"64. 32 is used as the default size.">,		"64. 32 is used as the default size.">,
];		];
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPUToNVVM		// GPUToNVVM
Show All 34 Lines	let dependentDialects = [
"cf::ControlFlowDialect",		"cf::ControlFlowDialect",
"memref::MemRefDialect",		"memref::MemRefDialect",
];		];
let options = [		let options = [
Option<"chipset", "chipset", "std::string",		Option<"chipset", "chipset", "std::string",
/default=/"\"gfx000\"",		/default=/"\"gfx000\"",
"Chipset that these operations will run on">,		"Chipset that these operations will run on">,
Option<"indexBitwidth", "index-bitwidth", "unsigned",		Option<"indexBitwidth", "index-bitwidth", "unsigned",
/default=kDeriveIndexBitwidthFromDataLayout/"0",		/default=kDeriveIndexBitwidthFromDataLayout/ "0",
"Bitwidth of the index type, 0 to use size of machine word">,		"Bitwidth of the index type, 0 to use size of machine word">,
Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",		Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",
/default=/"false",		/default=/"false",
"Replace memref arguments in GPU functions with bare pointers."		"Replace memref arguments in GPU functions with bare pointers."
"All memrefs must have static shape">,		"All memrefs must have static shape">,
Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",		Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",
"::mlir::gpu::amd::Runtime::Unknown",		"::mlir::gpu::amd::Runtime::Unknown",
"Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",		"Runtime code will be run on (default is Unknown, can also use HIP "
		"or OpenCl)",
[{::llvm::cl::values(		[{::llvm::cl::values(
clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),		clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown",
		"Unknown (default)"),
clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),		clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")		clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL",
)}]>,		"OpenCL"))}]>,
Option<"useOpaquePointers", "use-opaque-pointers", "bool",		Option<"useOpaquePointers", "use-opaque-pointers", "bool",
/default=/"true", "Generate LLVM IR using opaque pointers "		/default=/"true",
		"Generate LLVM IR using opaque pointers "
"instead of typed pointers">,		"instead of typed pointers">,
		Option<"warpSize", "warp-size", "unsigned",
		/default=/"32",
		"AMD GPUs have a configurable warp size; valid choices are 32 and "
		"64. 32 is used as the default size.">,
];		];
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		krzysz00Unsubmitted Done Reply Inline Actions This does make some sense to have here, yeah krzysz00: This does make some sense to have here, yeah
// GPUToSPIRV		// GPUToSPIRV
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ConvertGPUToSPIRV : Pass<"convert-gpu-to-spirv", "ModuleOp"> {		def ConvertGPUToSPIRV : Pass<"convert-gpu-to-spirv", "ModuleOp"> {
let summary = "Convert GPU dialect to SPIR-V dialect";		let summary = "Convert GPU dialect to SPIR-V dialect";
let description = [{		let description = [{
This pass converts supported GPU device ops to SPIR-V ops. It does not		This pass converts supported GPU device ops to SPIR-V ops. It does not
handle GPU host ops.		handle GPU host ops.
▲ Show 20 Lines • Show All 737 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt

	Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	mlir_tablegen(NVVMOpsEnums.cpp.inc -gen-enum-defs)			mlir_tablegen(NVVMOpsEnums.cpp.inc -gen-enum-defs)
	mlir_tablegen(NVVMOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=nvvm)			mlir_tablegen(NVVMOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=nvvm)
	mlir_tablegen(NVVMOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=nvvm)			mlir_tablegen(NVVMOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=nvvm)
	add_public_tablegen_target(MLIRNVVMConversionsIncGen)			add_public_tablegen_target(MLIRNVVMConversionsIncGen)

	add_mlir_dialect(ROCDLOps rocdl)			add_mlir_dialect(ROCDLOps rocdl)
	add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)			add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)
	set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)			set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)
				mlir_tablegen(ROCDLOpsEnums.h.inc -gen-enum-decls)
				mlir_tablegen(ROCDLOpsEnums.cpp.inc -gen-enum-defs)
				mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
				mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
	mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)			mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)
	mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)			mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
	mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)			mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
	add_public_tablegen_target(MLIRROCDLConversionsIncGen)			add_public_tablegen_target(MLIRROCDLConversionsIncGen)

mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h

	Show All 22 Lines
	#define MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_			#define MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_

	#include "mlir/Bytecode/BytecodeOpInterface.h"			#include "mlir/Bytecode/BytecodeOpInterface.h"
	#include "mlir/Dialect/LLVMIR/LLVMDialect.h"			#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
	#include "mlir/IR/Dialect.h"			#include "mlir/IR/Dialect.h"
	#include "mlir/IR/OpDefinition.h"			#include "mlir/IR/OpDefinition.h"
	#include "mlir/Interfaces/SideEffectInterfaces.h"			#include "mlir/Interfaces/SideEffectInterfaces.h"

				#include "mlir/Dialect/LLVMIR/ROCDLOpsEnums.h.inc"

	///// Ops /////			///// Ops /////
	#define GET_ATTRDEF_CLASSES			#define GET_ATTRDEF_CLASSES
	#include "mlir/Dialect/LLVMIR/ROCDLOpsAttributes.h.inc"			#include "mlir/Dialect/LLVMIR/ROCDLOpsAttributes.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/LLVMIR/ROCDLOps.h.inc"			#include "mlir/Dialect/LLVMIR/ROCDLOps.h.inc"

	#include "mlir/Dialect/LLVMIR/ROCDLOpsDialect.h.inc"			#include "mlir/Dialect/LLVMIR/ROCDLOpsDialect.h.inc"

	#endif /* MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_ */			#endif /* MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_ */

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

Show All 9 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef ROCDLIR_OPS		#ifndef ROCDLIR_OPS
#define ROCDLIR_OPS		#define ROCDLIR_OPS

include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"		include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
include "mlir/Dialect/LLVMIR/LLVMOpBase.td"		include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
		include "mlir/IR/EnumAttr.td"
include "mlir/Interfaces/SideEffectInterfaces.td"		include "mlir/Interfaces/SideEffectInterfaces.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ROCDL dialect definitions		// ROCDL dialect definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ROCDL_Dialect : Dialect {		def ROCDL_Dialect : Dialect {
let name = "rocdl";		let name = "rocdl";
▲ Show 20 Lines • Show All 231 Lines • ▼ Show 20 Lines	class ROCDL_Wmma_IntrOp<string mnemonic, list<Trait> traits = []> :
LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,		LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
"amdgcn_" # !subst(".","_", mnemonic),		"amdgcn_" # !subst(".","_", mnemonic),
[0], [], traits, 1>,		[0], [], traits, 1>,
Arguments<(ins Variadic<LLVM_Type>:$args)> {		Arguments<(ins Variadic<LLVM_Type>:$args)> {
let assemblyFormat =		let assemblyFormat =
"$args attr-dict `:` functional-type($args, $res)";		"$args attr-dict `:` functional-type($args, $res)";
}		}

		def ROCDLWMMAFragA : I32EnumAttrCase<"a", 0>;
		def ROCDLWMMAFragB : I32EnumAttrCase<"b", 1>;
		def ROCDLWMMAFragC : I32EnumAttrCase<"c", 2>;

		/// Enum attribute of the different frag types.
		def ROCDLWMMAFrag
		krzysz00Unsubmitted Done Reply Inline Actions Not sure we'll need these? krzysz00: Not sure we'll need these?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions These are actually needed currently to check some conditions in the lowerings. navdeepkk: These are actually needed currently to check some conditions in the lowerings.
		: I32EnumAttr<"ROCDLWMMAFrag", "ROCDL WMMA frag type",
		[ROCDLWMMAFragA, ROCDLWMMAFragB, ROCDLWMMAFragC]> {
		let genSpecializedAttr = 0;
		let cppNamespace = "::mlir::ROCDL";
		}

// Available on RDNA3		// Available on RDNA3
def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;		def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;
def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;		def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;
def ROCDL_wmma_f16_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f16.16x16x16.f16">;		def ROCDL_wmma_f16_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f16.16x16x16.f16">;
def ROCDL_wmma_bf16_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.bf16.16x16x16.bf16">;		def ROCDL_wmma_bf16_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.bf16.16x16x16.bf16">;
def ROCDL_wmma_i32_16x16x16_iu8 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu8">;		def ROCDL_wmma_i32_16x16x16_iu8 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu8">;
def ROCDL_wmma_i32_16x16x16_iu4 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu4">;		def ROCDL_wmma_i32_16x16x16_iu4 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu4">;

▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt

	set(LLVM_TARGET_DEFINITIONS GPUToROCDL.td)			set(LLVM_TARGET_DEFINITIONS GPUToROCDL.td)
	mlir_tablegen(GPUToROCDL.cpp.inc -gen-rewriters)			mlir_tablegen(GPUToROCDL.cpp.inc -gen-rewriters)
	add_public_tablegen_target(MLIRGPUToROCDLIncGen)			add_public_tablegen_target(MLIRGPUToROCDLIncGen)

	add_mlir_conversion_library(MLIRGPUToROCDLTransforms			add_mlir_conversion_library(MLIRGPUToROCDLTransforms
	LowerGpuOpsToROCDLOps.cpp			LowerGpuOpsToROCDLOps.cpp
				WmmaOpsToROCDL.cpp

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen
	MLIRGPUToROCDLIncGen			MLIRGPUToROCDLIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRArithToLLVM			MLIRArithToLLVM
	MLIRArithTransforms			MLIRArithTransforms
	Show All 11 Lines

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

Show All 16 Lines
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
#include "mlir/Pass/PassManager.h"		#include "mlir/Pass/PassManager.h"
#include "mlir/Transforms/Passes.h"		#include "mlir/Transforms/Passes.h"

#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"		#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"
#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"		#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"
#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"		#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"
#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"		#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
		#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"		#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Conversion/LLVMCommon/TypeConverter.h"		#include "mlir/Conversion/LLVMCommon/TypeConverter.h"
#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"		#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"
#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"		#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
Show All 26 Lines
static bool canBeCalledWithBarePointers(gpu::GPUFuncOp func) {		static bool canBeCalledWithBarePointers(gpu::GPUFuncOp func) {
bool canBeBare = true;		bool canBeBare = true;
for (Type type : func.getArgumentTypes())		for (Type type : func.getArgumentTypes())
if (auto memrefTy = dyn_cast<BaseMemRefType>(type))		if (auto memrefTy = dyn_cast<BaseMemRefType>(type))
canBeBare &= LLVMTypeConverter::canConvertToBarePtr(memrefTy);		canBeBare &= LLVMTypeConverter::canConvertToBarePtr(memrefTy);
return canBeBare;		return canBeBare;
}		}

Value amd::getLaneId(ConversionPatternRewriter &rewriter, Location loc,		Value amd::getLaneId(PatternRewriter &rewriter, Location loc,
const unsigned indexBitwidth) {		const unsigned indexBitwidth) {
// convert to: %mlo = call @llvm.amdgcn.mbcnt.lo(-1, 0)		// convert to: %mlo = call @llvm.amdgcn.mbcnt.lo(-1, 0)
// followed by: %lid = call @llvm.amdgcn.mbcnt.hi(-1, %mlo)		// followed by: %lid = call @llvm.amdgcn.mbcnt.hi(-1, %mlo)
MLIRContext *context = rewriter.getContext();		MLIRContext *context = rewriter.getContext();
Type intTy = IntegerType::get(context, 32);		Type intTy = IntegerType::get(context, 32);
Value zero = rewriter.createOrFold<arith::ConstantIntOp>(loc, 0, 32);		Value zero = rewriter.createOrFold<arith::ConstantIntOp>(loc, 0, 32);
Value minus1 = rewriter.createOrFold<arith::ConstantIntOp>(loc, -1, 32);		Value minus1 = rewriter.createOrFold<arith::ConstantIntOp>(loc, -1, 32);
Value mbcntLo =		Value mbcntLo =
rewriter.create<ROCDL::MbcntLoOp>(loc, intTy, ValueRange{minus1, zero});		rewriter.create<ROCDL::MbcntLoOp>(loc, intTy, ValueRange{minus1, zero});
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
// corresponding ROCDL equivalent.		// corresponding ROCDL equivalent.
//		//
// This pass only handles device code and is not meant to be run on GPU host		// This pass only handles device code and is not meant to be run on GPU host
// code.		// code.
struct LowerGpuOpsToROCDLOpsPass		struct LowerGpuOpsToROCDLOpsPass
: public impl::ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {		: public impl::ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {
LowerGpuOpsToROCDLOpsPass() = default;		LowerGpuOpsToROCDLOpsPass() = default;
LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,		LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,
bool useBarePtrCallConv,		bool useBarePtrCallConv, gpu::amd::Runtime runtime,
gpu::amd::Runtime runtime) {		unsigned warpSize) {
if (this->chipset.getNumOccurrences() == 0)		if (this->chipset.getNumOccurrences() == 0)
this->chipset = chipset;		this->chipset = chipset;
if (this->indexBitwidth.getNumOccurrences() == 0)		if (this->indexBitwidth.getNumOccurrences() == 0)
this->indexBitwidth = indexBitwidth;		this->indexBitwidth = indexBitwidth;
if (this->useBarePtrCallConv.getNumOccurrences() == 0)		if (this->useBarePtrCallConv.getNumOccurrences() == 0)
this->useBarePtrCallConv = useBarePtrCallConv;		this->useBarePtrCallConv = useBarePtrCallConv;
if (this->runtime.getNumOccurrences() == 0)		if (this->runtime.getNumOccurrences() == 0)
this->runtime = runtime;		this->runtime = runtime;
		if (this->warpSize.getNumOccurrences() == 0)
		this->warpSize = warpSize;
}		}

void runOnOperation() override {		void runOnOperation() override {
gpu::GPUModuleOp m = getOperation();		gpu::GPUModuleOp m = getOperation();
MLIRContext *ctx = m.getContext();		MLIRContext *ctx = m.getContext();

// Request C wrapper emission.		// Request C wrapper emission.
for (auto func : m.getOps<func::FuncOp>()) {		for (auto func : m.getOps<func::FuncOp>()) {
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	void runOnOperation() override {

mlir::arith::populateArithToLLVMConversionPatterns(converter, llvmPatterns);		mlir::arith::populateArithToLLVMConversionPatterns(converter, llvmPatterns);
populateAMDGPUToROCDLConversionPatterns(converter, llvmPatterns,		populateAMDGPUToROCDLConversionPatterns(converter, llvmPatterns,
*maybeChipset);		*maybeChipset);
populateVectorToLLVMConversionPatterns(converter, llvmPatterns);		populateVectorToLLVMConversionPatterns(converter, llvmPatterns);
cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);		cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);
populateFuncToLLVMConversionPatterns(converter, llvmPatterns);		populateFuncToLLVMConversionPatterns(converter, llvmPatterns);
populateFinalizeMemRefToLLVMConversionPatterns(converter, llvmPatterns);		populateFinalizeMemRefToLLVMConversionPatterns(converter, llvmPatterns);
populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime);		populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime,
		this->chipset, this->indexBitwidth,
		this->warpSize);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
configureGpuToROCDLConversionLegality(target);		configureGpuToROCDLConversionLegality(target);
if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))		if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))
signalPassFailure();		signalPassFailure();

// Manually rewrite known block size attributes so the LLVMIR translation		// Manually rewrite known block size attributes so the LLVMIR translation
// infrastructure can pick them up.		// infrastructure can pick them up.
m.walk([ctx](LLVM::LLVMFuncOp op) {		m.walk([ctx](LLVM::LLVMFuncOp op) {
Show All 35 Lines
template <typename OpTy>		template <typename OpTy>
static void populateOpPatterns(LLVMTypeConverter &converter,		static void populateOpPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns, StringRef f32Func,		RewritePatternSet &patterns, StringRef f32Func,
StringRef f64Func) {		StringRef f64Func) {
patterns.add<ScalarizeVectorOpLowering<OpTy>>(converter);		patterns.add<ScalarizeVectorOpLowering<OpTy>>(converter);
patterns.add<OpToFuncCallLowering<OpTy>>(converter, f32Func, f64Func);		patterns.add<OpToFuncCallLowering<OpTy>>(converter, f32Func, f64Func);
}		}

void mlir::populateGpuToROCDLConversionPatterns(		void mlir::populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,
LLVMTypeConverter &converter, RewritePatternSet &patterns,		RewritePatternSet &patterns,
mlir::gpu::amd::Runtime runtime) {		mlir::gpu::amd::Runtime runtime,
		StringRef chipset,
		unsigned indexBitwidth,
		unsigned warpSize) {
using mlir::gpu::amd::Runtime;		using mlir::gpu::amd::Runtime;

		// Lowering for MMAMatrixType.
		converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
		return amd::convertWMMAToROCDLLLVMType(type);
		});

populateWithGenerated(patterns);		populateWithGenerated(patterns);
patterns		patterns
.add<GPUIndexIntrinsicOpLowering<gpu::ThreadIdOp, ROCDL::ThreadIdXOp,		.add<GPUIndexIntrinsicOpLowering<gpu::ThreadIdOp, ROCDL::ThreadIdXOp,
ROCDL::ThreadIdYOp, ROCDL::ThreadIdZOp>>(		ROCDL::ThreadIdYOp, ROCDL::ThreadIdZOp>>(
converter, gpu::GPUFuncOp::getKnownBlockSizeAttrName());		converter, gpu::GPUFuncOp::getKnownBlockSizeAttrName());
patterns.add<GPUIndexIntrinsicOpLowering<		patterns.add<GPUIndexIntrinsicOpLowering<
gpu::BlockIdOp, ROCDL::BlockIdXOp, ROCDL::BlockIdYOp, ROCDL::BlockIdZOp>>(		gpu::BlockIdOp, ROCDL::BlockIdXOp, ROCDL::BlockIdYOp, ROCDL::BlockIdZOp>>(
converter, gpu::GPUFuncOp::getKnownGridSizeAttrName());		converter, gpu::GPUFuncOp::getKnownGridSizeAttrName());
Show All 13 Lines	if (Runtime::HIP == runtime) {
patterns.add<GPUPrintfOpToHIPLowering>(converter);		patterns.add<GPUPrintfOpToHIPLowering>(converter);
} else if (Runtime::OpenCL == runtime) {		} else if (Runtime::OpenCL == runtime) {
// Use address space = 4 to match the OpenCL definition of printf()		// Use address space = 4 to match the OpenCL definition of printf()
patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);		patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);
}		}

patterns.add<GPUShuffleOpLowering, GPULaneIdOpToROCDL>(converter);		patterns.add<GPUShuffleOpLowering, GPULaneIdOpToROCDL>(converter);

		/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
		populateGpuWMMAToROCDLConversionPatterns(converter, patterns, chipset,
		indexBitwidth, warpSize);

populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",		populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",
"__ocml_fabs_f64");		"__ocml_fabs_f64");
populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",		populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",
"__ocml_atan_f64");		"__ocml_atan_f64");
populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",		populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",
"__ocml_atan2_f64");		"__ocml_atan2_f64");
populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",		populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",
"__ocml_cbrt_f64");		"__ocml_cbrt_f64");
Show All 34 Lines	void mlir::populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,
populateOpPatterns<math::ErfOp>(converter, patterns, "__ocml_erf_f32",		populateOpPatterns<math::ErfOp>(converter, patterns, "__ocml_erf_f32",
"__ocml_erf_f64");		"__ocml_erf_f64");
}		}

std::unique_ptr<OperationPass<gpu::GPUModuleOp>>		std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,		mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,
unsigned indexBitwidth,		unsigned indexBitwidth,
bool useBarePtrCallConv,		bool useBarePtrCallConv,
gpu::amd::Runtime runtime) {		gpu::amd::Runtime runtime,
		unsigned warpSize) {
return std::make_unique<LowerGpuOpsToROCDLOpsPass>(		return std::make_unique<LowerGpuOpsToROCDLOpsPass>(
chipset, indexBitwidth, useBarePtrCallConv, runtime);		chipset, indexBitwidth, useBarePtrCallConv, runtime, warpSize);
}		}

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp

This file was added.

				//===--------- WmmaOpsToROCDL.cpp - GPU WMMA ops to ROCDL lowering --------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains definitions of patterns to lower GPU Subgroup MMA ops to
				// ROCDL Dialect.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"
				#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
				#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
				#include "mlir/Conversion/LLVMCommon/Pattern.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
				#include "mlir/IR/BuiltinAttributeInterfaces.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Support/LLVM.h"

				using namespace mlir;

				namespace {

				/// Checks if all the operands of the op being lowered are of LLVM Types. The
				/// types are expected to be converted by the `LLVMTypeConverter` before the op
				/// is actually lowered. If the type of an operands is not already converted it
				/// hints a missing typeConversion and failure is returned in that case.
				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				})) {
				return rewriter.notifyMatchFailure(
				op, "cannot convert if operands aren't of LLVM type.");
				}

				return success();
				}

				/// Return the WMMA operand corresponding to `operandName`.
				static ROCDL::ROCDLWMMAFrag convertOperand(StringRef operandName) {
				if (operandName.equals("AOp"))
				return ROCDL::ROCDLWMMAFrag::a;
				if (operandName.equals("BOp"))
				return ROCDL::ROCDLWMMAFrag::b;
				if (operandName.equals("COp"))
				return ROCDL::ROCDLWMMAFrag::c;
				llvm_unreachable("Unknown operand name");
				}

				/// Generate load ops for `AOp` or `BOp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` lane ID of the thread loading the
				/// values. `vecType` is the vector type of the values that will be loaded. The
				/// loaded values are returned in `loadedValues`. The address for loading the
				/// values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// loadedValues[i] = dataPtr + ((wrappedLaneId * leadingDim) + i);
				/// }
				static void generateAbLoadOpsVecFirst(Location loc, Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				mlir::TypedAttr x;
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				Value laneIdLdm =
				rewriter.create<LLVM::MulOp>(loc, wrappedLaneId, leadingDim);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, laneIdLdm, iter);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, iter);
				}
				}

				/// Generate load ops for `AOp` or `BOp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` is the lane ID of the thread
				/// loading the values. `vecType` is the vector type of the values that will be
				/// loaded. The loaded values are returned in `loadedValues`. The address for
				/// loading the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// loadedValues[i] = dataPtr + ((i * leadingDim) + wrappedLaneId);
				/// }
				static void generateAbLoadOpsLaneFirst(Location loc, Value dataPtr,
				Value laneId, Value leadingDim,
				VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value iterLdm = rewriter.create<LLVM::MulOp>(loc, iter, leadingDim);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, iterLdm, wrappedLaneId);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, iter);
				}
				}

				/// Generate load ops for `COp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` is the lane ID of the
				/// thread loading the values. `vecType` is the vector type of the values that
				/// will be loaded. The loaded values are returned in `loadedValues`. The
				/// address for loading the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// row = i * 2 + (laneId / 16)
				/// if opSelect
				/// loadedValues[i * 2 + 1] = dataPtr + ((row * leadingDim) +
				/// wrappedLaneId);
				/// else
				/// loadedValues[i * 2] = dataPtr + ((row * leadingDim) + wrappedLaneId);
				/// }
				static void generateCLoadOpsLaneFirst(bool opSelect, Location loc,
				krzysz00Unsubmitted Done Reply Inline Actions I'm not sure why these are `LaneFirst` functions. Could you expand on that? krzysz00:* I'm not sure why these are `*LaneFirst` functions. Could you expand on that?
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing. navdeepkk: It's just a naming convention I used as the laneId is the innermost position of the indexing…
				Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				Value constTwo = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/2);
				Value sixteen = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value laneIdHalf = rewriter.create<LLVM::SDivOp>(loc, laneId, sixteen);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value iterTwo = rewriter.create<LLVM::MulOp>(loc, iter, constTwo);
				Value row = rewriter.create<LLVM::AddOp>(loc, iterTwo, laneIdHalf);
				Value rowLdm = rewriter.create<LLVM::MulOp>(loc, row, leadingDim);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, rowLdm, wrappedLaneId);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				// We have to skip every second element if opselect is true.
				Value inx = iter;
				if (vecType.getElementType().isF16()) {
				if (opSelect) {
				Value constOne =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/1);
				inx = rewriter.create<LLVM::AddOp>(loc, iterTwo, constOne);
				} else {
				inx = iterTwo;
				}
				}
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, inx);
				}
				}

				/// Generate load ops for `AOp`, `BOp`, or `COp`. `opSelect` is the opSelect bit
				/// governing how to store/load half precision `COp` values. `transpose` tells
				/// if the matrix has to be loaded in a transposed manner. `frag` is the type of
				/// the WMMA operand being loaded. `dataPtr` is the base address starting from
				/// which values will be loaded. `vecType` is the vector type of the values that
				/// will be loaded. The loaded values are returned in `loadedValues`.
				static LogicalResult generateLoadOps(bool opSelect, bool transpose,
				Location loc, ROCDL::ROCDLWMMAFrag frag,
				unsigned indexBitwidth, Value dataPtr,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				Value laneId = amd::getLaneId(rewriter, loc, indexBitwidth);
				Type eltType = vecType.getElementType();
				if (frag == ROCDL::ROCDLWMMAFrag::a && !transpose && eltType.isF16()) {
				generateAbLoadOpsVecFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::a && transpose && eltType.isF16()) {
				generateAbLoadOpsLaneFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::b && transpose && eltType.isF16()) {
				generateAbLoadOpsVecFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::b && !transpose && eltType.isF16()) {
				generateAbLoadOpsLaneFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::c && !transpose &&
				(eltType.isF32() \|\| eltType.isF16())) {
				generateCLoadOpsLaneFirst(opSelect, loc, dataPtr, laneId, leadingDim,
				vecType, rewriter, loadedValues);
				return success();
				}

				return failure();
				}

				/// This class implements the conversion of GPU MMA loadOp to wmma.load op
				/// in the ROCDL dialect. The conversion not only emits the ROCDL op but also
				/// emits code that is necessary to store the data in the destination memref
				/// after it has been loaded.
				struct WmmaLoadOpToROCDLLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp> {
				WmmaLoadOpToROCDLLowering(LLVMTypeConverter &typeConverter, StringRef chip,
				unsigned indexBitwidth, unsigned warpSize)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp>(typeConverter),
				indexBitwidth(indexBitwidth), warpSize(warpSize), chip(chip){};

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
				OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				if (failed(areAllLLVMTypes(subgroupMmaLoadMatrixOp.getOperation(),
				adaptor.getOperands(), rewriter)))
				return failure();

				std::size_t firstPos = chip.find("gfx11");
				std::size_t lastPos = chip.rfind("gfx11");
				if (firstPos != 0 \|\| (firstPos != lastPos))
				return subgroupMmaLoadMatrixOp->emitError(
				"wmma lowering is supported for gfx11 series only");

				if (warpSize != amd::kWaveFrontSize32)
				return subgroupMmaLoadMatrixOp->emitError(
				"only size 32 wavefronts are supported");
				krzysz00Unsubmitted Done Reply Inline Actions Grammar nit here, perhaps "only size 32 wavefronts are supported" krzysz00: Grammar nit here, perhaps "only size 32 wavefronts are supported"

				auto transpose = subgroupMmaLoadMatrixOp.getTranspose();
				gpu::MMAMatrixType retType =
				subgroupMmaLoadMatrixOp.getRes().getType().cast<gpu::MMAMatrixType>();
				SmallVector<int64_t> retTypeShape(retType.getShape());

				if (!llvm::all_of(retTypeShape, [](int dim) { return dim == 16; }))
				return subgroupMmaLoadMatrixOp->emitError(
				"wmma ops of shape 16x16x16 are only supported.");

				auto srcMemrefType =
				subgroupMmaLoadMatrixOp.getSrcMemref().getType().cast<MemRefType>();

				if (srcMemrefType.getElementType() != retType.getElementType())
				return subgroupMmaLoadMatrixOp->emitError(
				"src memref type and mma matrix element type must be same");

				// Get the LLVM type of corresponding to the result MMAMatrixType.
				Type llvmRetType = amd::convertWMMAToROCDLLLVMType(retType);

				// We need to declare a vector type and then emit instructions to load the
				// elements into the vector type.
				Location loc = subgroupMmaLoadMatrixOp.getLoc();
				Value dataPtr =
				getStridedElementPtr(loc, srcMemrefType, adaptor.getSrcMemref(),
				adaptor.getIndices(), rewriter);

				Value leadingDim = rewriter.create<LLVM::ConstantOp>(
				loc, rewriter.getI32Type(),
				subgroupMmaLoadMatrixOp.getLeadDimensionAttr());

				Value loadedValues;
				ROCDL::ROCDLWMMAFrag operand = convertOperand(retType.getOperand());
				if (auto vecType = dyn_cast<VectorType>(llvmRetType)) {
				bool opSelect = subgroupMmaLoadMatrixOp->hasAttrOfType<UnitAttr>(
				amd::kAMDGpuOpselectAttrName);
				if (failed(generateLoadOps(opSelect,
				transpose.has_value() && transpose.value(),
				loc, operand, indexBitwidth, dataPtr,
				leadingDim, vecType, rewriter, loadedValues)))
				return rewriter.notifyMatchFailure(subgroupMmaLoadMatrixOp,
				"unsupported load op variant.");
				rewriter.replaceOp(subgroupMmaLoadMatrixOp, loadedValues);
				return success();
				}
				return rewriter.notifyMatchFailure(subgroupMmaLoadMatrixOp,
				"unsupported load op variant.");
				}

				/// Index bitwidth to use in any index calculation.
				unsigned indexBitwidth;

				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				unsigned warpSize;

				/// The target chip for which to generate the lowering.
				std::string chip;
				};

				/// Generate store ops for `COp`. `dataPtr` is the base address starting
				/// to which the values will be stored. `laneId` is the lane ID of the
				/// thread loading the values. `vecType` is the vector type of the values that
				/// are being stored. The values to be stored are supplied in `toStore`. The
				/// address for storing the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// row = i * 2 + (laneId / 16)
				/// if opSelect
				/// store toStore[i * 2 + 1], dataPtr + ((row * leadingDim) + wrappedLaneId)
				/// else
				/// store toStore[i * 2], dataPtr + ((row * leadingDim) + wrappedLaneId)
				/// }
				static void generateCStoreOpsLaneFirst(bool opSelect, Location loc,
				Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				Value toStore,
				PatternRewriter &rewriter) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				Value constSixteen =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value constTwo = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/2);
				Value laneIdHalf = rewriter.create<LLVM::SDivOp>(loc, laneId, constSixteen);
				for (int i = 0; i < vecType.getNumElements(); ++i) {
				Value inx = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value inxTimesTwo = rewriter.create<LLVM::MulOp>(loc, inx, constTwo);
				Value row = rewriter.create<LLVM::AddOp>(loc, laneIdHalf, inxTimesTwo);
				Value rowLdm = rewriter.create<LLVM::MulOp>(loc, row, leadingDim);
				Value offset = rewriter.create<LLVM::AddOp>(loc, rowLdm, wrappedLaneId);
				Value storeAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, offset);
				Value toStoreAtInx;
				if (vecType.getElementType().isF16()) {
				if (!opSelect) {
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inxTimesTwo);

				} else {
				Value constOne =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/1);
				Value inxTimesTwoAddOne =
				rewriter.create<LLVM::AddOp>(loc, inxTimesTwo, constOne);
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inxTimesTwoAddOne);
				}
				} else if (vecType.getElementType().isF32()) {
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inx);
				}
				rewriter.create<LLVM::StoreOp>(loc, toStoreAtInx, storeAddress);
				}
				}

				/// Generate store ops for `COp`. `opSelect` is the opSelect bit governing how
				/// to store half precision `COp` values. `frag` is the type of the WMMA
				/// operand being stored. `dataPtr` is the base address starting from which
				/// starting from which the values will be stored. `vecType` is the vector type
				/// of the values being stored. `toStore` contains the values to be stored.
				static LogicalResult generateStoreOps(bool opSelect, Location loc,
				ROCDL::ROCDLWMMAFrag frag, Value dataPtr,
				unsigned indexBitwidth, Value leadingDim,
				VectorType vecType, Value toStore,
				PatternRewriter &rewriter) {
				// Store ops can only be generated for C operands.
				if (frag != ROCDL::ROCDLWMMAFrag::c)
				return emitError(toStore.getLoc(), "only COp can be stored");

				// Get the laneID.
				Value laneId = amd::getLaneId(rewriter, loc, indexBitwidth);
				Type eltType = vecType.getElementType();
				if (eltType.isF16() \|\| eltType.isF32()) {
				generateCStoreOpsLaneFirst(opSelect, loc, dataPtr, laneId, leadingDim,
				vecType, toStore, rewriter);
				return success();
				}

				return failure();
				}

				/// This class implements the conversion of GPU MMA storeOp to wmma.store op
				/// in the ROCDL dialect.
				struct WmmaStoreOpToROCDLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp> {
				WmmaStoreOpToROCDLowering(LLVMTypeConverter &typeConverter, StringRef chip,
				unsigned indexBitwidth, unsigned warpSize)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp>(typeConverter),
				indexBitwidth(indexBitwidth), warpSize(warpSize), chip(chip){};

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
				OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				if (failed(areAllLLVMTypes(subgroupMmaStoreMatrixOp.getOperation(),
				adaptor.getOperands(), rewriter)))
				return failure();

				std::size_t firstPos = chip.find("gfx11");
				std::size_t lastPos = chip.rfind("gfx11");
				if (firstPos != 0 \|\| (firstPos != lastPos))
				return subgroupMmaStoreMatrixOp->emitError(
				"wmma lowering is supported for gfx11 series only");
				krzysz00Unsubmitted Done Reply Inline Actions That's too strong a restriction - WMMA is available on all the gfx11NN GPUs krzysz00: That's too strong a restriction - WMMA is available on all the gfx11NN GPUs
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested. navdeepkk: Are the thread data mappings the same for them too? Can you please point me to a reference?

				if (warpSize != amd::kWaveFrontSize32)
				return subgroupMmaStoreMatrixOp->emitError(
				"wavefront of size 32 only supported");

				Location loc = subgroupMmaStoreMatrixOp->getLoc();

				auto transpose = subgroupMmaStoreMatrixOp.getTranspose();
				if (transpose.has_value() && transpose.value())
				return subgroupMmaStoreMatrixOp->emitError(
				"lowering with transpose is not supported.");

				gpu::MMAMatrixType retType =
				subgroupMmaStoreMatrixOp.getSrc().getType().cast<gpu::MMAMatrixType>();
				SmallVector<int64_t> retTypeShape(retType.getShape());

				if (!llvm::all_of(retTypeShape, [](int dim) { return dim == 16; }))
				return subgroupMmaStoreMatrixOp->emitError(
				"wmma ops of shape 16x16x16 are only supported.");

				auto dstMemrefType =
				subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>();

				if (dstMemrefType.getElementType() != retType.getElementType())
				return subgroupMmaStoreMatrixOp->emitError(
				"dst memref type and mma matrix element type must be same");

				Value dataPtr = getStridedElementPtr(
				loc,
				subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>(),
				adaptor.getDstMemref(), adaptor.getIndices(), rewriter);
				Value leadingDim = rewriter.create<LLVM::ConstantOp>(
				loc, rewriter.getI32Type(),
				subgroupMmaStoreMatrixOp.getLeadDimensionAttr());

				// Get the LLVM type of corresponding to the result MMAMatrixType.
				Type llvmRetType = amd::convertWMMAToROCDLLLVMType(retType);

				Value toStore = adaptor.getSrc();

				bool opSelect = subgroupMmaStoreMatrixOp->hasAttrOfType<UnitAttr>(
				amd::kAMDGpuOpselectAttrName);
				if (auto vecType = dyn_cast<VectorType>(llvmRetType)) {
				if (failed(generateStoreOps(
				opSelect, loc, convertOperand(retType.getOperand()), dataPtr,
				indexBitwidth, leadingDim, vecType, toStore, rewriter)))
				return rewriter.notifyMatchFailure(subgroupMmaStoreMatrixOp,
				"unsupported store op variant.");
				}
				rewriter.eraseOp(subgroupMmaStoreMatrixOp);
				return success();
				}

				/// Index bitwidth to use in any index calculation.
				unsigned indexBitwidth;

				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				unsigned warpSize;

				/// The target chip for which to generate the lowering.
				std::string chip;
				};
				krzysz00Unsubmitted Done Reply Inline Actions There's something suspicious about this being here krzysz00: There's something suspicious about this being here
				} // namespace

				// Convert the MMAMatrix type to LLVM types based of the elemental type of
				// MMAMatrixType.
				Type mlir::amd::convertWMMAToROCDLLLVMType(
				mlir::gpu::MMAMatrixType matrixType) {
				Type eltType = matrixType.getElementType();
				ROCDL::ROCDLWMMAFrag frag = convertOperand(matrixType.getOperand());
				if (eltType.isF16() &&
				(frag == ROCDL::ROCDLWMMAFrag::a \|\| frag == ROCDL::ROCDLWMMAFrag::b \|\|
				frag == ROCDL::ROCDLWMMAFrag::c))
				return VectorType::get({16}, eltType);
				if (eltType.isF32() && frag == ROCDL::ROCDLWMMAFrag::c)
				return VectorType::get({8}, eltType);

				llvm_unreachable("Unsupported data type");
				}

				void mlir::populateGpuWMMAToROCDLConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns, StringRef chip,
				unsigned indexBitwidth, unsigned warpSize) {
				patterns.add<WmmaLoadOpToROCDLLowering, WmmaStoreOpToROCDLowering>(
				converter, chip, indexBitwidth, warpSize);
				}

mlir/test/CMakeLists.txt

Show All 25 Lines	set(ARM_EMULATOR_LLI_EXECUTABLE "" CACHE STRING
"If arch-specific Arm integration tests run emulated, use this Arm native lli.")		"If arch-specific Arm integration tests run emulated, use this Arm native lli.")
set(ARM_EMULATOR_UTILS_LIB_DIR "" CACHE STRING		set(ARM_EMULATOR_UTILS_LIB_DIR "" CACHE STRING
"If arch-specific Arm integration tests run emulated, find Arm native utility libraries in this directory.")		"If arch-specific Arm integration tests run emulated, find Arm native utility libraries in this directory.")
set(MLIR_GPU_COMPILATION_TEST_FORMAT "fatbin" CACHE STRING		set(MLIR_GPU_COMPILATION_TEST_FORMAT "fatbin" CACHE STRING
"The GPU compilation format used by the tests.")		"The GPU compilation format used by the tests.")
option(MLIR_RUN_AMX_TESTS "Run AMX tests.")		option(MLIR_RUN_AMX_TESTS "Run AMX tests.")
option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")		option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")
option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")		option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")
		option(MLIR_RUN_ROCM_WMMA_TESTS "Run WMMA tests for AMD GPU.")
		krzysz00Unsubmitted Done Reply Inline Actions Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is known to support WMMA"? krzysz00: Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is…
option(MLIR_RUN_CUDA_SM80_TESTS "Run CUDA A100 tests.")		option(MLIR_RUN_CUDA_SM80_TESTS "Run CUDA A100 tests.")
option(MLIR_RUN_CUDA_SM80_LT_TESTS "Run CUDA A100 structured sparsity tests.")		option(MLIR_RUN_CUDA_SM80_LT_TESTS "Run CUDA A100 structured sparsity tests.")
option(MLIR_RUN_CUDA_SM90_TESTS "Run CUDA H100 tests.")		option(MLIR_RUN_CUDA_SM90_TESTS "Run CUDA H100 tests.")
option(MLIR_RUN_ARM_SVE_TESTS "Run Arm SVE tests.")		option(MLIR_RUN_ARM_SVE_TESTS "Run Arm SVE tests.")
option(MLIR_RUN_ARM_SME_TESTS "Run Arm SME tests.")		option(MLIR_RUN_ARM_SME_TESTS "Run Arm SME tests.")

		set(GFX_WMMA_TARGET "gfx1100")
		krzysz00Unsubmitted Not Done Reply Inline Actions Yeah, let's loosen this up and have it check for prefix gfx11 or something like that. krzysz00: Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right? navdeepkk: I have relaxed the checks in the passes. I think this variable has to have to complete name. If…

# The native target may not be enabled when cross compiling, raise an error.		# The native target may not be enabled when cross compiling, raise an error.
if(NOT MLIR_ENABLE_EXECUTION_ENGINE)		if(NOT MLIR_ENABLE_EXECUTION_ENGINE)
message(FATAL_ERROR "MLIR_INCLUDE_INTEGRATION_TESTS requires a native target")		message(FATAL_ERROR "MLIR_INCLUDE_INTEGRATION_TESTS requires a native target")
		krzysz00Unsubmitted Done Reply Inline Actions This should be a user-configurable option? krzysz00: This should be a user-configurable option?
endif()		endif()

# When the Integration tests are requested via the MLIR_INCLUDE_INTEGRATION_TESTS		# When the Integration tests are requested via the MLIR_INCLUDE_INTEGRATION_TESTS
# configuration flag, we automatically include sm80 tests when build for		# configuration flag, we automatically include sm80 tests when build for
# cuSparse when the configuration flag MLIR_ENABLE_CUDA_CUSPARSE is set and		# cuSparse when the configuration flag MLIR_ENABLE_CUDA_CUSPARSE is set and
# include sm80 lt tests when the MLIR_ENABLE_CUDA_CUSPARSELT is set in		# include sm80 lt tests when the MLIR_ENABLE_CUDA_CUSPARSELT is set in
# addition to those.		# addition to those.
if (MLIR_ENABLE_CUDA_CUSPARSE)		if (MLIR_ENABLE_CUDA_CUSPARSE)
Show All 12 Lines	llvm_canonicalize_cmake_booleans(
MLIR_ENABLE_CUDA_RUNNER		MLIR_ENABLE_CUDA_RUNNER
MLIR_ENABLE_ROCM_CONVERSIONS		MLIR_ENABLE_ROCM_CONVERSIONS
MLIR_ENABLE_ROCM_RUNNER		MLIR_ENABLE_ROCM_RUNNER
MLIR_ENABLE_SPIRV_CPU_RUNNER		MLIR_ENABLE_SPIRV_CPU_RUNNER
MLIR_ENABLE_VULKAN_RUNNER		MLIR_ENABLE_VULKAN_RUNNER
MLIR_INCLUDE_INTEGRATION_TESTS		MLIR_INCLUDE_INTEGRATION_TESTS
MLIR_RUN_AMX_TESTS		MLIR_RUN_AMX_TESTS
MLIR_RUN_CUDA_TENSOR_CORE_TESTS		MLIR_RUN_CUDA_TENSOR_CORE_TESTS
		MLIR_RUN_ROCM_WMMA_TESTS
MLIR_RUN_X86VECTOR_TESTS		MLIR_RUN_X86VECTOR_TESTS
MLIR_RUN_ARM_SVE_TESTS		MLIR_RUN_ARM_SVE_TESTS
MLIR_RUN_ARM_SME_TESTS		MLIR_RUN_ARM_SME_TESTS
MLIR_RUN_CUDA_SM80_TESTS		MLIR_RUN_CUDA_SM80_TESTS
MLIR_RUN_CUDA_SM80_LT_TESTS		MLIR_RUN_CUDA_SM80_LT_TESTS
MLIR_RUN_CUDA_SM90_TESTS		MLIR_RUN_CUDA_SM90_TESTS
)		)

▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported-chipset.mlir

This file was added.

				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx900 index-bitwidth=32' -split-input-file -verify-diagnostics

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose_invalid_shape
				func.func @load_a_op_16_16_16_no_transpose()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// expected-error@-1 {{wmma lowering is supported for gfx11 series only}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf32, 3>
				// expected-error@-1 {{wmma lowering is supported for gfx11 series only}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir

This file was added.

				// This file tests the we error out properly when unsupported ops are
				// encountered for GPU wmma ops to ROCDL conversion.
				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx1100 index-bitwidth=32' -split-input-file -verify-diagnostics

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose_invalid_shape
				func.func @load_a_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "AOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_transpose_invalid_shape
				func.func @load_a_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "AOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose_invalid_types
				func.func @load_a_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose_invalid_shape
				func.func @load_b_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "BOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_transpose_invalid_shape
				func.func @load_b_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "BOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose_invalid_types
				func.func @load_b_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_transpose_invalid_shape
				func.func @load_c_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "COp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_transpose_invalid_shape
				func.func @load_c_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "COp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_transpose_invalid_types
				func.func @load_c_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<32x8xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<32x8xf32, "COp">, memref<32x32xf32, 3>
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index, transpose} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf32, 3>
				// expected-error@-1 {{lowering with transpose is not supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf16, 3>
				// expected-error@-1 {{dst memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir

This file was added.

				// This file tests the conversion of GPU WMMA ops to ROCDL dialect.
				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx1100 index-bitwidth=32' -split-input-file \| FileCheck %s

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose
				func.func @load_a_op_16_16_16_no_transpose()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK-NEXT: %[[C0_I32:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[C_1_I32:.*]] = llvm.mlir.constant(-1 : i32) : i32
				// CHECK-NEXT: %[[MBCNT_LO:.*]] = rocdl.mbcnt.lo %[[C_1_I32]], %[[C0_I32]] : (i32, i32) -> i32
				// CHECK-NEXT: %[[WARPLOCALTID:.*]] = rocdl.mbcnt.hi %[[C_1_I32]], %[[MBCNT_LO]] : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// The part checked up to this point will be common in most of the WMMA op
				// lowerings. Checking all of these lines will be skipped in the subsequent
				// tests as the same utility emits the IR up to this point. Only some
				// values which are used later will be matched.
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[WRAPPEDTID32:.*]] = llvm.mul %[[WRAPPEDTID]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C0]] : i32
				// CHECK-NEXT: %[[LOADADDR0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[LOADADDR0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C1]] : i32
				// CHECK-NEXT: %[[LOADADDR1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[LOADADDR1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[OFFSET15:.]] = llvm.add %[[WRAPPEDTID32]], %{{.}} : i32
				// CHECK-NEXT: %[[LOADADDR15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADEDVALS15:.*]] = llvm.load %[[LOADADDR15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADEDVALS15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_transpose
				func.func @load_a_op_16_16_16_transpose()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, {{.*}} : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.mul %[[C0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROW0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.mul %[[C1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROW1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK-NEXT: %[[ROW15:.*]] = llvm.mul %[[C15]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET15:.*]] = llvm.add %[[ROW15]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED15:.*]] = llvm.load %[[ADDRESS15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADED15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose
				func.func @load_b_op_16_16_16_no_transpose()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, {{.*}} : i32
				// CHECK: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.mul %[[C0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROW0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.mul %[[C1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROW1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK-NEXT: %[[ROW15:.*]] = llvm.mul %[[C15]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET15:.*]] = llvm.add %[[ROW15]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED15:.*]] = llvm.load %[[ADDRESS15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADED15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_transpose
				func.func @load_b_op_16_16_16_transpose()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, %{{.*}} : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[WRAPPEDTID32:.*]] = llvm.mul %[[WRAPPEDTID]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C0]] : i32
				// CHECK-NEXT: %[[LOADADDR0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[LOADADDR0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C1]] : i32
				// CHECK-NEXT: %[[LOADADDR1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[LOADADDR1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[OFFSET15:.]] = llvm.add %[[WRAPPEDTID32]], %{{.}} : i32
				// CHECK-NEXT: %[[LOADADDR15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADEDVALS15:.*]] = llvm.load %[[LOADADDR15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADEDVALS15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_opselect
				func.func @load_c_op_16_16_16_no_opselect()->(!gpu.mma_matrix<16x16xf32, "COp">) {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg_1[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf32, "COp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<8xf32>
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f32
				// CHECK-NEXT: %[[LOADEDVAL0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<8xf32>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f32
				// CHECK-NEXT: %[[LOADEDVAL1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVAL0]][%[[C1]] : i32] : vector<8xf32>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C7:.*]] = llvm.mlir.constant(7 : i32) : i32
				// CHECK: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.*}}[%[[C7]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<8xf32>
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_opselect
				func.func @load_c_op_16_16_16_no_opselect()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[ITER0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[ITER1]] : i32] : vector<16xf16>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.}}[%{{.}} : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_opselect
				func.func @load_c_op_16_16_16_opselect()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, opSelect} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[C1C:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[VECOFFSET0:.*]] = llvm.add %[[ITER0]], %[[C1C]] : i32
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[VECOFFSET0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[C1C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[VECOFFSET1:.*]] = llvm.add %[[ITER1]], %[[C1C1]] : i32
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[VECOFFSET1]] : i32] : vector<16xf16>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.}}[%{{.}} : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				// CHECK-SAME: (%[[SRC:.*]]: vector<8xf32>)
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf32, 3>
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[C0]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f32, !llvm.ptr<3>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[C1]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f32, !llvm.ptr<3>
				// CHECK: %[[C7:.*]] = llvm.mlir.constant(7 : i32) : i32
				// CHECK: %[[ELE7:.*]] = llvm.extractelement %[[SRC]][%[[C7]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE7]], %{{.*}} : f32, !llvm.ptr<3>
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f16_no_opsel
				// CHECK-SAME: (%[[SRC:.*]]: vector<16xf16>)
				func.func @store_cop_f16_no_opsel(%arg0: !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[ITER0]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f16, !llvm.ptr<3>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[ITER1]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f16, !llvm.ptr<3>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[ITER15:.*]] = llvm.mul %[[C15]], %[[C2]] : i32
				// CHECK: %[[ADDRESS15:.]] = llvm.getelementptr %[[BASE]][%{{.}}] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE15:.*]] = llvm.extractelement %[[SRC]][%[[ITER15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE15]], %[[ADDRESS15]] : f16, !llvm.ptr<3>
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f16_opsel
				// CHECK-SAME: (%[[SRC:.*]]: vector<16xf16>)
				func.func @store_cop_f16_opsel(%arg0: !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index, opSelect} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[C01:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[INX0:.*]] = llvm.add %[[ITER0]], %[[C01]] : i32
				// CHECK-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[INX0]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f16, !llvm.ptr<3>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[C11:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[INX1:.*]] = llvm.add %[[ITER1]], %[[C11]] : i32
				// CHECK-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[INX1]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f16, !llvm.ptr<3>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK-NEXT: %[[ITER15:.*]] = llvm.mul %[[C15]], %[[C2]] : i32
				// CHECK: %[[ADDRESS15:.]] = llvm.getelementptr %[[BASE]][%{{.}}] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[C151:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[INX15:.*]] = llvm.add %[[ITER15]], %[[C151]] : i32
				// CHECK-NEXT: %[[ELE15:.*]] = llvm.extractelement %[[SRC]][%[[INX15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE15]], %[[ADDRESS15]] : f16, !llvm.ptr<3>
				return
				}
				}

mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg

This file was added.

				import sys

				# WMMA tests must be enabled via build flag.
				if not config.mlir_run_rocm_wmma_tests:
				config.unsupported = True

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts),rocdl-attach-target{chip=%chip})' \
				// RUN: \| mlir-opt -gpu-to-llvm -gpu-module-to-binary \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf16>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f16
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf16>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf16>, memref<16x16xf16>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf16>, memref<16x16xf16>
				gpu.wait [%stream]

				%res_f32 = memref.alloc() : memref<16x16xf32>
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%load = memref.load %gpu_out[%arg0, %arg1] : memref<16x16xf16>
				%ext = arith.extf %load : f16 to f32
				memref.store %ext, %res_f32[%arg0, %arg1] : memref<16x16xf32>
				}
				}
				%res_f32_cast = memref.cast %res_f32 : memref<16x16xf32> to memref<*xf32>

				// Print the memref after computation.
				call @printMemrefF32(%res_f32_cast) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts),rocdl-attach-target{chip=%chip})' \
				// RUN: \| mlir-opt -gpu-to-llvm -gpu-module-to-binary \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf16>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f16
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf16>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf16>, memref<16x16xf16>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index, opSelect} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C {opSelect} : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index, opSelect}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf16>, memref<16x16xf16>
				gpu.wait [%stream]

				%res_f32 = memref.alloc() : memref<16x16xf32>
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%load = memref.load %gpu_out[%arg0, %arg1] : memref<16x16xf16>
				%ext = arith.extf %load : f16 to f32
				memref.store %ext, %res_f32[%arg0, %arg1] : memref<16x16xf32>
				}
				}
				%res_f32_cast = memref.cast %res_f32 : memref<16x16xf32> to memref<*xf32>

				// Print the memref after computation.
				call @printMemrefF32(%res_f32_cast) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_x2.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts),rocdl-attach-target{chip=%chip})' \
				// RUN: \| mlir-opt -gpu-to-llvm -gpu-module-to-binary \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				// This file implements WMMA operation in which there are two
				// subgroup_mma_compute operations which utilize the lower and upper halves of
				// the 32-bit accumulator registers.

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x32xf16>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f16
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c32 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x32xf16>
				}
				}

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x32xf16>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x32xf16>, memref<16x32xf16>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C_lo = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 32 : index} : memref<16x32xf16> -> !gpu.mma_matrix<16x16xf16, "COp">
				%C_hi = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c16] {leadDimension = 32 : index, opSelect} : memref<16x32xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R_lo = gpu.subgroup_mma_compute %A, %B, %C_lo : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
				%R_hi = gpu.subgroup_mma_compute %A, %B, %C_hi {opSelect} : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R_lo, %gpu_out[%c0, %c0] {leadDimension = 32 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x32xf16>
				gpu.subgroup_mma_store_matrix %R_hi, %gpu_out[%c0, %c16] {leadDimension = 32 : index, opSelect}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x32xf16>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x32xf16>, memref<16x32xf16>
				gpu.wait [%stream]

				%res_f32 = memref.alloc() : memref<16x32xf32>
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c32 step %c1 {
				%load = memref.load %gpu_out[%arg0, %arg1] : memref<16x32xf16>
				%ext = arith.extf %load : f16 to f32
				memref.store %ext, %res_f32[%arg0, %arg1] : memref<16x32xf32>
				}
				}
				%res_f32_cast = memref.cast %res_f32 : memref<16x32xf32> to memref<*xf32>

				// Print the memref after computation.
				call @printMemrefF32(%res_f32_cast) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040, 1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400, 1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760, 1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120, 1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480, 1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840, 1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200, 1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556, 2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920, 2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284, 2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640, 2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996, 2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360, 2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724, 2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080, 2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440, 3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts),rocdl-attach-target{chip=%chip})' \
				// RUN: \| mlir-opt -gpu-to-llvm -gpu-module-to-binary \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf32>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f32
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
				%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf32>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf32>, memref<16x16xf32>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf32>, memref<16x16xf32>
				gpu.wait [%stream]

				// Print the memref after computation.
				call @printMemrefF32(%33) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5096, 5328, 5560],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5176, 5424, 5672, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5224, 5488, 5752, 6016, 6280],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5240, 5520, 5800, 6080, 6360, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5224, 5520, 5816, 6112, 6408, 6704, 7000],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5176, 5488, 5800, 6112, 6424, 6736, 7048, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5096, 5424, 5752, 6080, 6408, 6736, 7064, 7392, 7720],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5672, 6016, 6360, 6704, 7048, 7392, 7736, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5560, 5920, 6280, 6640, 7000, 7360, 7720, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts),rocdl-attach-target{chip=%chip})' \
				// RUN: \| mlir-opt -gpu-to-llvm -gpu-module-to-binary \
				krzysz00Unsubmitted Done Reply Inline Actions Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm integration tests. krzysz00: Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm…
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf32>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f32
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%float = arith.sitofp %cast_c : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
				%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf32>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf32>, memref<16x16xf32>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index, transpose} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index, transpose} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf32>, memref<16x16xf32>
				gpu.wait [%stream]

				// Print the memref after computation.
				call @printMemrefF32(%33) : (memref<*xf32>) -> ()
				// CHECK: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
				// CHECK-NEXT: [120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120],
				// CHECK-NEXT: [240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240],
				// CHECK-NEXT: [360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360],
				// CHECK-NEXT: [480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480],
				// CHECK-NEXT: [600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600],
				// CHECK-NEXT: [720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720],
				// CHECK-NEXT: [840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840],
				// CHECK-NEXT: [960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960],
				// CHECK-NEXT: [1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080],
				// CHECK-NEXT: [1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200],
				// CHECK-NEXT: [1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320],
				// CHECK-NEXT: [1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440],
				// CHECK-NEXT: [1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560],
				// CHECK-NEXT: [1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680],
				// CHECK-NEXT: [1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/lit.site.cfg.py.in

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	# requires <cond> to be in the set of available features.			# requires <cond> to be in the set of available features.
	# TODO: Update LIT's TestRunner so that this is not required.			# TODO: Update LIT's TestRunner so that this is not required.
	if config.mlir_run_arm_sve_tests:			if config.mlir_run_arm_sve_tests:
	config.available_features.add("mlir_arm_sve_tests")			config.available_features.add("mlir_arm_sve_tests")
	config.mlir_run_arm_sme_tests = @MLIR_RUN_ARM_SME_TESTS@			config.mlir_run_arm_sme_tests = @MLIR_RUN_ARM_SME_TESTS@
	config.mlir_run_x86vector_tests = @MLIR_RUN_X86VECTOR_TESTS@			config.mlir_run_x86vector_tests = @MLIR_RUN_X86VECTOR_TESTS@
	config.mlir_run_riscv_vector_tests = "@MLIR_RUN_RISCV_VECTOR_TESTS@"			config.mlir_run_riscv_vector_tests = "@MLIR_RUN_RISCV_VECTOR_TESTS@"
	config.mlir_run_cuda_tensor_core_tests = @MLIR_RUN_CUDA_TENSOR_CORE_TESTS@			config.mlir_run_cuda_tensor_core_tests = @MLIR_RUN_CUDA_TENSOR_CORE_TESTS@
				config.mlir_run_rocm_wmma_tests = @MLIR_RUN_ROCM_WMMA_TESTS@
	config.mlir_run_cuda_sm80_tests = @MLIR_RUN_CUDA_SM80_TESTS@			config.mlir_run_cuda_sm80_tests = @MLIR_RUN_CUDA_SM80_TESTS@
	config.mlir_run_cuda_sm80_lt_tests = @MLIR_RUN_CUDA_SM80_LT_TESTS@			config.mlir_run_cuda_sm80_lt_tests = @MLIR_RUN_CUDA_SM80_LT_TESTS@
	config.mlir_run_cuda_sm90_tests = @MLIR_RUN_CUDA_SM90_TESTS@			config.mlir_run_cuda_sm90_tests = @MLIR_RUN_CUDA_SM90_TESTS@
	config.mlir_include_integration_tests = @MLIR_INCLUDE_INTEGRATION_TESTS@			config.mlir_include_integration_tests = @MLIR_INCLUDE_INTEGRATION_TESTS@
	config.arm_emulator_executable = "@ARM_EMULATOR_EXECUTABLE@"			config.arm_emulator_executable = "@ARM_EMULATOR_EXECUTABLE@"
	config.arm_emulator_options = "@ARM_EMULATOR_OPTIONS@"			config.arm_emulator_options = "@ARM_EMULATOR_OPTIONS@"
	config.arm_emulator_mlir_cpu_runner_executable = "@ARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE@"			config.arm_emulator_mlir_cpu_runner_executable = "@ARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE@"
	config.arm_emulator_lli_executable = "@ARM_EMULATOR_LLI_EXECUTABLE@"			config.arm_emulator_lli_executable = "@ARM_EMULATOR_LLI_EXECUTABLE@"
	Show All 11 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialectNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557722

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

mlir/include/mlir/Conversion/Passes.td

mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt

mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp

mlir/test/CMakeLists.txt

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported-chipset.mlir

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir

mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_x2.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir

mlir/test/lit.site.cfg.py.in

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect
Needs ReviewPublic