This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Conversion/
-
GPUToAMDGPU/
-
GPUToAMDGPUPass.h
-
GPUToROCDL/
1/1
GPUToROCDLPass.h
-
Passes.h
3/3
Passes.td
-
Dialect/LLVMIR/
-
LLVMIR/
-
CMakeLists.txt
-
ROCDLDialect.h
2/2
ROCDLOps.td
-
lib/Conversion/
-
Conversion/
-
CMakeLists.txt
-
GPUToAMDGPU/
-
CMakeLists.txt
-
LowerGPUOpsToAMDGPUOps.cpp
3/3
WmmaOpsToAMDGPU.cpp
-
GPUToROCDL/
-
CMakeLists.txt
-
LowerGpuOpsToROCDLOps.cpp
6/6
WmmaOpsToROCDL.cpp
-
test/
3/4
CMakeLists.txt
-
Conversion/
-
GPUToAMDGPU/
-
wmma-ops-to-amdgpu-unsupported-chipset.mlir
-
wmma-ops-to-amdgpu-unsupported-operands.mlir
-
wmma-ops-to-amdgpu-unsupported-warpsize.mlir
-
wmma-ops-to-amdgpu.mlir
-
GPUToROCDL/
-
wmma-ops-to-rocdl-unsupported.mlir
-
wmma-ops-to-rocdl.mlir
-
Integration/GPU/ROCM/WMMA/
-
GPU/
-
ROCM/
-
WMMA/
-
lit.local.cfg
-
wmma_f16_16_16_16_f16.mlir
-
wmma_f16_16_16_16_f16_opselect.mlir
-
wmma_f32_16_16_16_f16.mlir
1/1
wmma_f32_16_16_16_f16_a_b_transpose.mlir
-
lit.site.cfg.py.in

Differential D157228

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect
Needs ReviewPublic

Authored by navdeepkk on Aug 6 2023, 6:39 AM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
dcaballe
nicolasvasilache
herhut
bondhugula
giuseros
krzysz00

Summary

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

navdeepkk created this revision.Aug 6 2023, 6:39 AM

Herald added a reviewer: ftynse. · View Herald TranscriptAug 6 2023, 6:39 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 26 others. · View Herald Transcript

navdeepkk requested review of this revision.Aug 6 2023, 6:39 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 6 2023, 6:39 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk edited the summary of this revision. (Show Details)Aug 6 2023, 6:42 AM

navdeepkk added reviewers: bondhugula, giuseros, krzysz00.

Harbormaster completed remote builds in B250612: Diff 547575.Aug 6 2023, 6:58 AM

In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.
https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it
Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

mlir/test/CMakeLists.txt
32	Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is known to support WMMA"?
mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir
5	Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm integration tests.

This revision now requires changes to proceed.Aug 6 2023, 7:52 AM

(amdgpu.wmma is at https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td#L445 )

In D157228#4564063, @krzysz00 wrote:

In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.

https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it

Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

Thanks for the review. I'll work on addressing 2 and 3. Regarding point one, as GPU dialect is supposed to serve as a common abstraction for both AMD and Nvidia GPUs we can just lower it to the ROCDL intrinsic. I am not sure if the AMD GPU intrinsics will significantly deviate from the GPU dialect ops already present. Say If we decide to go from the GPU dialect to the AMD GPU dialect intrinsic and the AMD GPU dialect intrinsics capture more information, we would have to have that information in the GPU dialect ops somehow as well, Or we could just skip the GPU dialect completely which I don't think will be a good idea as the frontends would need two different mechanisms (possibly sharing significant code) to map to AMD or Nvidia GPU. The current patch enables the users to just generate GPU dialect ops and just lower it to AMD or Nvidia GPUs with the switch of a single pass. Please let me know if there is a stronger motivation or if I am missing something.

Thanks!

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

In D157228#4573344, @krzysz00 wrote:

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

Thanks. This sheds more light on this. I am trying to understand can the GPU dialect ops themselves capture all the information (via attributes) that is needed to lower them to the ROCDL dialect ops (for both WMMA and MFMA). If that is the case then the lowering from GPU dialect to ROCDL will be the single place where all the information/logic will reside. If it is not possible to represent all of the WMMA and MFMA functionality in the GPU dialect ops then we can just refactor this patch into two stages. Another point is that there is no intrinsic for the load and store ops. Will we even want them to be present or just use the GPU dialect intrinsic and lower them to ROCDL/LLVM transparently?

On a separate note, AFAIK there is a path to lower MFMA ops from the GPU dialect to AMD-GPU currently. Do we have a plan to have that support or are the users supposed to AMD-GPU dialect ops directly?

The reason I want to lower the matrix ops from GPU to AMDGPU + memref/vector + ... instead of to ROCDL and LLVM directory is separation of concerns and progressive lowering.

gpu-to-amdgpu lowers GPU matrix operations to MLIR constricts that abstract away some of the fiddler details of the underlying intrinsics (such as how F16 * F16 => F16 doesn't return a vector of F16 (with half its entries updated) but instead a vector of 32-bit values with the low or high halves updated.

Going straight from GPU to ROCDL means you have to copy all the complexity that's present in amdgpu-to-rocdl and I don't want to maintain two copies of the dame code if I don't have to, hence my request to do a two-stage lowering.

And re load and store, unlike on Nvidia, WMMA is just an instruction: loads and stores should be lowered to operations on a private memory attribution (aka an wlloca() ) that replaces each GPU matrix value.

As to MFMA, @sjw36 for more there since he's been poking at that lowering

Just wanted to check on what you're planning here

In D157228#4643220, @krzysz00 wrote:

Just wanted to check on what you're planning here

Hi @krzysz00, I will take all the suggestions and update this patch in a few days.

Address second major comment on the revision. Use gpu.lane_id lowering to get
laneID while converting WMMA ops.

Harbormaster completed remote builds in B257447: Diff 557107.Sep 20 2023, 5:55 AM

Add a separate pass to convert gpu.subgroup_mma_compute op to amdgpu.wmma op

Harbormaster completed remote builds in B257558: Diff 557276.Sep 24 2023, 1:15 AM

Add missing commits

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptSep 24 2023, 1:21 AM

Harbormaster completed remote builds in B257559: Diff 557277.Sep 24 2023, 1:22 AM

Format comment in LowerGPUOpsToAMDGPUOps.cpp

Harbormaster completed remote builds in B257560: Diff 557278.Sep 24 2023, 1:27 AM

@krzysz00 Please review.

I think this is headed in a good direction, I just have some thoughts, mostly about how the pass options were set up

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
72	I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a module attribute that `SerializeToHsaco` picks up on for correct device library linking/option setting. Anything that's requiring reasoning about `opSelect` should be moved up to `gpu-to-amdgpu`. Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values.
mlir/include/mlir/Conversion/Passes.td
432	I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation that uses both halves of the output registers. Alternatively, you could make `amdgpu.opSelect` an attribute on the relevant operations instead of making it a pass-wide knob.
518	This does make some sense to have here, yeah
mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
221	Not sure we'll need these?
mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
85	(To clarify, this could be two compute ops, one with opSelect = false and one with opSelect = true)
165	Do we need this here? (also, if you'll have this here, you'll want the data layout incantations somewhere in the pass)
mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
142	I'm not sure why these are `*LaneFirst` functions. Could you expand on that?
257	Grammar nit here, perhaps "only size 32 wavefronts are supported"
425	That's too strong a restriction - WMMA is available on all the gfx11NN GPUs
487	There's something suspicious about this being here
mlir/test/CMakeLists.txt
38	Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.
42	This should be a user-configurable option?

Address comments and implement upper/lower half FP16 WMMA

@krzysz00 Please review.

mlir/include/mlir/Conversion/Passes.td
432	Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an `opAttribute` and just generate code with the `opselect` attribute set and unset.
mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
221	These are actually needed currently to check some conditions in the lowerings.
mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
165	This can be dropped actually. Thanks.
mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
142	It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing.
425	Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested.
mlir/test/CMakeLists.txt
38	I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right?

Harbormaster completed remote builds in B257830: Diff 557722.Oct 17 2023, 1:59 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUToAMDGPU/

GPUToAMDGPUPass.h

85 lines

GPUToROCDL/

GPUToROCDLPass.h

43 lines

Passes.h

1 line

Passes.td

65 lines

Dialect/

LLVMIR/

CMakeLists.txt

4 lines

ROCDLDialect.h

2 lines

ROCDLOps.td

13 lines

lib/

Conversion/

CMakeLists.txt

1 line

GPUToAMDGPU/

CMakeLists.txt

18 lines

LowerGPUOpsToAMDGPUOps.cpp

106 lines

WmmaOpsToAMDGPU.cpp

186 lines

GPUToROCDL/

CMakeLists.txt

1 line

LowerGpuOpsToROCDLOps.cpp

30 lines

WmmaOpsToROCDL.cpp

515 lines

test/

CMakeLists.txt

10 lines

Conversion/

GPUToAMDGPU/

wmma-ops-to-amdgpu-unsupported-chipset.mlir

10 lines

wmma-ops-to-amdgpu-unsupported-operands.mlir

33 lines

wmma-ops-to-amdgpu-unsupported-warpsize.mlir

10 lines

wmma-ops-to-amdgpu.mlir

43 lines

GPUToROCDL/

wmma-ops-to-rocdl-unsupported.mlir

181 lines

wmma-ops-to-rocdl.mlir

443 lines

Integration/

GPU/

ROCM/

WMMA/

lit.local.cfg

5 lines

wmma_f16_16_16_16_f16.mlir

95 lines

wmma_f16_16_16_16_f16_opselect.mlir

95 lines

wmma_f32_16_16_16_f16.mlir

86 lines

wmma_f32_16_16_16_f16_a_b_transpose.mlir

84 lines

lit.site.cfg.py.in

1 line

Diff 557278

mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h

This file was added.

				//===- GPUToAMDGPUPass.h - Convert GPU kernel to AMDGPU dialect ---- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				#ifndef MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_
				#define MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_

				#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/BuiltinTypes.h"
				#include "mlir/IR/Value.h"
				#include "mlir/Transforms/DialectConversion.h"
				#include <memory>

				namespace llvm {
				class StringRef;
				} // namespace llvm

				namespace mlir {
				class ConversionTarget;
				class OpBuilder;
				class Location;
				class RewritePatternSet;
				class Type;
				class TypeConverter;

				template <typename OpT>
				class OperationPass;

				namespace gpu {
				class GPUModuleOp;
				class MMAMatrixType;
				} // namespace gpu

				#define GEN_PASS_DECL_CONVERTGPUOPSTOAMDGPUOPS
				#include "mlir/Conversion/Passes.h.inc"

				namespace amd {
				/// Return the LLVM Type corresponding to the MMAMatrixType.
				Type convertWMMAToVectorType(gpu::MMAMatrixType matrixType);
				} // namespace amd

				/// Collect a set of patterns to convert from the GPU dialect to AMDGPU.
				/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
				/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
				/// targeting. `indexBitwidth` is the bitwidth to be used while convertind index
				/// types. `warpSize` is the warp size to use when generating WMMA intrinsics.
				/// `opSelect` is used in the lowering of f16 versions of WMMA ops involving `C`
				/// operand. If `opSelect` is true upper half of the general purpose 32-bit
				/// registers is used for storing the values; If false the lower half is used.
				void populateGpuToAMDGPUConversionPatterns(
				TypeConverter &typeConverter, RewritePatternSet &patterns,
				llvm::StringRef chipset = "gfx1100",
				unsigned indexBitwidth = 32, bool opSelect = false, unsigned warpSize = 32);

				/// Creates a pass that lowers GPU dialect operations to AMDGPU counterparts. The
				/// index bitwidth used for the lowering of the device side index computations
				/// is configurable. AMD gpus have a configurable warp size; valid choices are
				/// 32 and 64. We choose 32 as the default size. `opSelect` is used in the
				/// lowering of f16 versions of WMMA ops involving `C` operand. If `opSelect` is
				/// true upper half of the general purpose 32-bit registers is used for storing
				/// the values; If false the lower half is used.
				std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
				createLowerGpuOpsToAMDGPUOpsPass(
				const std::string &chipset = "gfx1100",
				unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
				bool opSelect = false, unsigned warpSize = 32);

				/// Collect a set of patterns to convert WMMA ops from GPU dialect to AMDGPU.
				/// `chipset` is the target chip for which the IR is being generated.
				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				void populateGpuWMMAToAMDGPUConversionPatterns(TypeConverter &typeConverter,
				RewritePatternSet &patterns,
				llvm::StringRef chipset,
				unsigned indexBitwidth,
				bool opSelect, unsigned warpSize);

				} // namespace mlir

				#endif // MLIR_CONVERSION_GPUTOAMDGPU_GPUTOAMDGPUPASS_H_

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

	Show All 11 Lines
	#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"			#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
	#include "mlir/Dialect/Arith/IR/Arith.h"			#include "mlir/Dialect/Arith/IR/Arith.h"
	#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"			#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/Value.h"			#include "mlir/IR/Value.h"
	#include <memory>			#include <memory>

				namespace llvm {
				class StringRef;
				} // namespace llvm

	namespace mlir {			namespace mlir {
	class LLVMTypeConverter;			class LLVMTypeConverter;
	class ConversionTarget;			class ConversionTarget;
	class OpBuilder;			class OpBuilder;
	class Location;			class Location;
	class RewritePatternSet;			class RewritePatternSet;
				class Type;

	template <typename OpT>			template <typename OpT>
	class OperationPass;			class OperationPass;

	namespace gpu {			namespace gpu {
	class GPUModuleOp;			class GPUModuleOp;
				class MMAMatrixType;
	} // namespace gpu			} // namespace gpu

	#define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS			#define GEN_PASS_DECL_CONVERTGPUOPSTOROCDLOPS
	#include "mlir/Conversion/Passes.h.inc"			#include "mlir/Conversion/Passes.h.inc"

	namespace amd {			namespace amd {
	/// Constant representing 32 workitems in a workgroup.			/// Constant representing 32 workitems in a workgroup.
	const unsigned kWaveFrontSize32 = 32;			const unsigned kWaveFrontSize32 = 32;

	/// Constant representing 64 workitems in a workgroup.			/// Constant representing 64 workitems in a workgroup.
	const unsigned kWaveFrontSize64 = 64;			const unsigned kWaveFrontSize64 = 64;

	/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.			/// Wavefront sizes that are supported by the GPU to ROCDL lowerings.
	const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,			const unsigned kWMMASupportedWaveFrontSizes[] = {kWaveFrontSize32,
	kWaveFrontSize64};			kWaveFrontSize64};

	/// Generate ops to get the laneId of the current lane and return it.			/// Generate ops to get the laneId of the current lane and return it.
	Value getLaneId(Location loc, OpBuilder b, unsigned indexBitwidth);			Value getLaneId(Location loc, OpBuilder b, unsigned indexBitwidth);

				/// Return the LLVM Type corresponding to the MMAMatrixType.
				Type convertWMMAToROCDLLLVMType(gpu::MMAMatrixType matrixType);
	} // namespace amd			} // namespace amd

	/// Collect a set of patterns to convert from the GPU dialect to ROCDL.			/// Collect a set of patterns to convert from the GPU dialect to ROCDL.
	/// If `runtime` is Unknown, gpu.printf will not be lowered			/// If `runtime` is Unknown, gpu.printf will not be lowered. The resulting
	/// The resulting pattern set should be run over a gpu.module op			/// pattern set should be run over a gpu.module op. `chipset` is the chip we are
	void populateGpuToROCDLConversionPatterns(LLVMTypeConverter &converter,			/// targeting. `indexBitwidth` is the bitwidth to be used while convertind index
	RewritePatternSet &patterns,			/// types. `warpSize` is the warp size to use when generating WMMA intrinsics.
	gpu::amd::Runtime runtime);			/// `opSelect` is used in the lowering of f16 versions of WMMA ops involving `C`
				/// operand. If `opSelect` is true upper half of the general purpose 32-bit
				/// registers is used for storing the values; If false the lower half is used.
				void populateGpuToROCDLConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns,
				gpu::amd::Runtime runtime, llvm::StringRef chipset = "gfx900",
				unsigned indexBitwidth = 32, bool opSelec = false, unsigned warpSize = 32);
				krzysz00Unsubmitted Done Reply Inline Actions I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a module attribute that `SerializeToHsaco` picks up on for correct device library linking/option setting. Anything that's requiring reasoning about `opSelect` should be moved up to `gpu-to-amdgpu`. Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values. krzysz00: I don't want `opSelect` here, though I'm willing to allow `warpSize` so long as that becomes a…

	/// Configure target to convert from the GPU dialect to ROCDL.			/// Configure target to convert from the GPU dialect to ROCDL.
	void configureGpuToROCDLConversionLegality(ConversionTarget &target);			void configureGpuToROCDLConversionLegality(ConversionTarget &target);

	/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The			/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The
	/// index bitwidth used for the lowering of the device side index computations			/// index bitwidth used for the lowering of the device side index computations
	/// is configurable.			/// is configurable. AMD gpus have a configurable warp size; valid choices are
				/// 32 and 64. We choose 32 as the default size. `opSelect` is used in the
				/// lowering of f16 versions of WMMA ops involving `C` operand. If `opSelect` is
				/// true upper half of the general purpose 32-bit registers is used for storing
				/// the values; If false the lower half is used.
	std::unique_ptr<OperationPass<gpu::GPUModuleOp>>			std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
	createLowerGpuOpsToROCDLOpsPass(			createLowerGpuOpsToROCDLOpsPass(
	const std::string &chipset = "gfx900",			const std::string &chipset = "gfx900",
	unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,			unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
	bool useBarePtrCallConv = false,			bool useBarePtrCallConv = false,
	gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);			gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown,
				bool opSelect = false, unsigned warpSize = 32);

				/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
				/// `chipset` is the target chip for which the IR is being generated.
				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				void populateGpuWMMAToROCDLConversionPatterns(LLVMTypeConverter &converter,
				RewritePatternSet &patterns,
				llvm::StringRef chipset,
				unsigned indexBitwidth,
				bool opSelect, unsigned warpSize);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_			#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_

mlir/include/mlir/Conversion/Passes.h

	Show All 21 Lines
	#include "mlir/Conversion/ComplexToSPIRV/ComplexToSPIRVPass.h"			#include "mlir/Conversion/ComplexToSPIRV/ComplexToSPIRVPass.h"
	#include "mlir/Conversion/ComplexToStandard/ComplexToStandard.h"			#include "mlir/Conversion/ComplexToStandard/ComplexToStandard.h"
	#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"			#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"
	#include "mlir/Conversion/ControlFlowToSPIRV/ControlFlowToSPIRV.h"			#include "mlir/Conversion/ControlFlowToSPIRV/ControlFlowToSPIRV.h"
	#include "mlir/Conversion/ControlFlowToSPIRV/ControlFlowToSPIRVPass.h"			#include "mlir/Conversion/ControlFlowToSPIRV/ControlFlowToSPIRVPass.h"
	#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"			#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"
	#include "mlir/Conversion/FuncToSPIRV/FuncToSPIRVPass.h"			#include "mlir/Conversion/FuncToSPIRV/FuncToSPIRVPass.h"
	#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"			#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
				#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
	#include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"			#include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"
	#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"			#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
	#include "mlir/Conversion/GPUToSPIRV/GPUToSPIRVPass.h"			#include "mlir/Conversion/GPUToSPIRV/GPUToSPIRVPass.h"
	#include "mlir/Conversion/GPUToVulkan/ConvertGPUToVulkanPass.h"			#include "mlir/Conversion/GPUToVulkan/ConvertGPUToVulkanPass.h"
	#include "mlir/Conversion/IndexToLLVM/IndexToLLVM.h"			#include "mlir/Conversion/IndexToLLVM/IndexToLLVM.h"
	#include "mlir/Conversion/LinalgToStandard/LinalgToStandard.h"			#include "mlir/Conversion/LinalgToStandard/LinalgToStandard.h"
	#include "mlir/Conversion/MathToFuncs/MathToFuncs.h"			#include "mlir/Conversion/MathToFuncs/MathToFuncs.h"
	#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"			#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"
	Show All 39 Lines

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 407 Lines • ▼ Show 20 Lines	Option<"useOpaquePointers", "use-opaque-pointers", "bool",
/default=/"true", "Generate LLVM IR using opaque pointers "		/default=/"true", "Generate LLVM IR using opaque pointers "
"instead of typed pointers">		"instead of typed pointers">
];		];

let dependentDialects = ["LLVM::LLVMDialect"];		let dependentDialects = ["LLVM::LLVMDialect"];
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// GPUToAMDGPU
		//===----------------------------------------------------------------------===//

		def ConvertGpuOpsToAMDGPUOps : Pass<"convert-gpu-to-amdgpu", "gpu::GPUModuleOp"> {
		let summary = "Generate AMD GPU operations for gpu operations";
		let constructor = "mlir::createLowerGpuOpsToAMDGPUOpsPass()";
		let dependentDialects = [
		"amdgpu::AMDGPUDialect",
		];
		let options = [
		Option<"chipset", "chipset", "std::string",
		/default=/"\"gfx000\"",
		"Chipset that these operations will run on">,
		Option<"indexBitwidth", "index-bitwidth", "unsigned",
		/default=kDeriveIndexBitwidthFromDataLayout/ "0",
		"Bitwidth of the index type, 0 to use size of machine word">,
		Option<
		krzysz00Unsubmitted Done Reply Inline Actions I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation that uses both halves of the output registers. Alternatively, you could make `amdgpu.opSelect` an attribute on the relevant operations instead of making it a pass-wide knob. krzysz00: I'm not convinced this needs to be here. You could implement a larger subgroup MMA operation…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an `opAttribute` and just generate code with the `opselect` attribute set and unset. navdeepkk: Thanks, I wasn't aware of this use case. I was under the impression that this was the job of…
		"opSelect", "opselect", "bool",
		/default=/"false",
		"`opSelect` is used in the lowering of f16 versions of WMMA ops "
		"involving `C` operand. If `opSelect` is true, the upper half of the "
		"general purpose 32-bit registers is used for storing the values; "
		"If false the lower half is used.">,
		Option<"warpSize", "warp-size", "unsigned",
		/default=/"32",
		"AMD GPUs have a configurable warp size; valid choices are 32 and "
		"64. 32 is used as the default size.">,
		];
		}

		//===----------------------------------------------------------------------===//
// GPUToNVVM		// GPUToNVVM
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ConvertGpuOpsToNVVMOps : Pass<"convert-gpu-to-nvvm", "gpu::GPUModuleOp"> {		def ConvertGpuOpsToNVVMOps : Pass<"convert-gpu-to-nvvm", "gpu::GPUModuleOp"> {
let summary = "Generate NVVM operations for gpu operations";		let summary = "Generate NVVM operations for gpu operations";
let constructor = "mlir::createLowerGpuOpsToNVVMOpsPass()";		let constructor = "mlir::createLowerGpuOpsToNVVMOpsPass()";
let dependentDialects = [		let dependentDialects = [
"cf::ControlFlowDialect",		"cf::ControlFlowDialect",
Show All 28 Lines	let dependentDialects = [
"cf::ControlFlowDialect",		"cf::ControlFlowDialect",
"memref::MemRefDialect",		"memref::MemRefDialect",
];		];
let options = [		let options = [
Option<"chipset", "chipset", "std::string",		Option<"chipset", "chipset", "std::string",
/default=/"\"gfx000\"",		/default=/"\"gfx000\"",
"Chipset that these operations will run on">,		"Chipset that these operations will run on">,
Option<"indexBitwidth", "index-bitwidth", "unsigned",		Option<"indexBitwidth", "index-bitwidth", "unsigned",
/default=kDeriveIndexBitwidthFromDataLayout/"0",		/default=kDeriveIndexBitwidthFromDataLayout/ "0",
"Bitwidth of the index type, 0 to use size of machine word">,		"Bitwidth of the index type, 0 to use size of machine word">,
Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",		Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",
/default=/"false",		/default=/"false",
"Replace memref arguments in GPU functions with bare pointers."		"Replace memref arguments in GPU functions with bare pointers."
"All memrefs must have static shape">,		"All memrefs must have static shape">,
Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",		Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",
"::mlir::gpu::amd::Runtime::Unknown",		"::mlir::gpu::amd::Runtime::Unknown",
"Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",		"Runtime code will be run on (default is Unknown, can also use HIP "
		"or OpenCl)",
[{::llvm::cl::values(		[{::llvm::cl::values(
clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),		clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown",
		"Unknown (default)"),
clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),		clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")		clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL",
)}]>,		"OpenCL"))}]>,
Option<"useOpaquePointers", "use-opaque-pointers", "bool",		Option<"useOpaquePointers", "use-opaque-pointers", "bool",
/default=/"true", "Generate LLVM IR using opaque pointers "		/default=/"true",
		"Generate LLVM IR using opaque pointers "
"instead of typed pointers">,		"instead of typed pointers">,
		Option<
		"opSelect", "opselect", "bool",
		/default=/"false",
		"`opSelect` is used in the lowering of f16 versions of WMMA ops "
		"involving `C` operand. If `opSelect` is true, the upper half of the "
		"general purpose 32-bit registers is used for storing the values; "
		"If false the lower half is used.">,
		Option<"warpSize", "warp-size", "unsigned",
		krzysz00Unsubmitted Done Reply Inline Actions This does make some sense to have here, yeah krzysz00: This does make some sense to have here, yeah
		/default=/"32",
		"AMD GPUs have a configurable warp size; valid choices are 32 and "
		"64. 32 is used as the default size.">,
];		];
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPUToSPIRV		// GPUToSPIRV
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ConvertGPUToSPIRV : Pass<"convert-gpu-to-spirv", "ModuleOp"> {		def ConvertGPUToSPIRV : Pass<"convert-gpu-to-spirv", "ModuleOp"> {
▲ Show 20 Lines • Show All 706 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	mlir_tablegen(NVVMOpsInterface.cpp.inc -gen-op-interface-defs)			mlir_tablegen(NVVMOpsInterface.cpp.inc -gen-op-interface-defs)
	mlir_tablegen(NVVMOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=nvvm)			mlir_tablegen(NVVMOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=nvvm)
	mlir_tablegen(NVVMOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=nvvm)			mlir_tablegen(NVVMOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=nvvm)
	add_public_tablegen_target(MLIRNVVMConversionsIncGen)			add_public_tablegen_target(MLIRNVVMConversionsIncGen)

	add_mlir_dialect(ROCDLOps rocdl)			add_mlir_dialect(ROCDLOps rocdl)
	add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)			add_mlir_doc(ROCDLOps ROCDLDialect Dialects/ -gen-dialect-doc -dialect=rocdl)
	set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)			set(LLVM_TARGET_DEFINITIONS ROCDLOps.td)
				mlir_tablegen(ROCDLOpsEnums.h.inc -gen-enum-decls)
				mlir_tablegen(ROCDLOpsEnums.cpp.inc -gen-enum-defs)
				mlir_tablegen(ROCDLOpsAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=rocdl)
				mlir_tablegen(ROCDLOpsAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=rocdl)
	mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)			mlir_tablegen(ROCDLConversions.inc -gen-llvmir-conversions)
	add_public_tablegen_target(MLIRROCDLConversionsIncGen)			add_public_tablegen_target(MLIRROCDLConversionsIncGen)

mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h

	Show All 22 Lines
	#define MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_			#define MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_

	#include "mlir/Bytecode/BytecodeOpInterface.h"			#include "mlir/Bytecode/BytecodeOpInterface.h"
	#include "mlir/Dialect/LLVMIR/LLVMDialect.h"			#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
	#include "mlir/IR/Dialect.h"			#include "mlir/IR/Dialect.h"
	#include "mlir/IR/OpDefinition.h"			#include "mlir/IR/OpDefinition.h"
	#include "mlir/Interfaces/SideEffectInterfaces.h"			#include "mlir/Interfaces/SideEffectInterfaces.h"

				#include "mlir/Dialect/LLVMIR/ROCDLOpsEnums.h.inc"

	///// Ops /////			///// Ops /////
	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/LLVMIR/ROCDLOps.h.inc"			#include "mlir/Dialect/LLVMIR/ROCDLOps.h.inc"

	#include "mlir/Dialect/LLVMIR/ROCDLOpsDialect.h.inc"			#include "mlir/Dialect/LLVMIR/ROCDLOpsDialect.h.inc"

	#endif /* MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_ */			#endif /* MLIR_DIALECT_LLVMIR_ROCDLDIALECT_H_ */

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

//===-- ROCDLOps.td - ROCDL IR dialect op definition file --- tablegen --===//		//===-- ROCDLOps.td - ROCDL IR dialect op definition file --- tablegen --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This is the ROCDL IR operation definition file.		// This is the ROCDL IR operation definition file.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef ROCDLIR_OPS		#ifndef ROCDLIR_OPS
#define ROCDLIR_OPS		#define ROCDLIR_OPS

include "mlir/Dialect/LLVMIR/LLVMOpBase.td"		include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
		include "mlir/IR/EnumAttr.td"
include "mlir/Interfaces/SideEffectInterfaces.td"		include "mlir/Interfaces/SideEffectInterfaces.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ROCDL dialect definitions		// ROCDL dialect definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def ROCDL_Dialect : Dialect {		def ROCDL_Dialect : Dialect {
let name = "rocdl";		let name = "rocdl";
▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines	class ROCDL_Wmma_IntrOp<string mnemonic, list<Trait> traits = []> :
LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,		LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
"amdgcn_" # !subst(".","_", mnemonic),		"amdgcn_" # !subst(".","_", mnemonic),
[0], [], traits, 1>,		[0], [], traits, 1>,
Arguments<(ins Variadic<LLVM_Type>:$args)> {		Arguments<(ins Variadic<LLVM_Type>:$args)> {
let assemblyFormat =		let assemblyFormat =
"$args attr-dict `:` functional-type($args, $res)";		"$args attr-dict `:` functional-type($args, $res)";
}		}

		def ROCDLWMMAFragA : I32EnumAttrCase<"a", 0>;
		def ROCDLWMMAFragB : I32EnumAttrCase<"b", 1>;
		def ROCDLWMMAFragC : I32EnumAttrCase<"c", 2>;

		/// Enum attribute of the different frag types.
		def ROCDLWMMAFrag
		krzysz00Unsubmitted Done Reply Inline Actions Not sure we'll need these? krzysz00: Not sure we'll need these?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions These are actually needed currently to check some conditions in the lowerings. navdeepkk: These are actually needed currently to check some conditions in the lowerings.
		: I32EnumAttr<"ROCDLWMMAFrag", "ROCDL WMMA frag type",
		[ROCDLWMMAFragA, ROCDLWMMAFragB, ROCDLWMMAFragC]> {
		let genSpecializedAttr = 0;
		let cppNamespace = "::mlir::ROCDL";
		}

// Available on RDNA3		// Available on RDNA3
def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;		def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;
def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;		def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;
def ROCDL_wmma_f16_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f16.16x16x16.f16">;		def ROCDL_wmma_f16_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f16.16x16x16.f16">;
def ROCDL_wmma_bf16_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.bf16.16x16x16.bf16">;		def ROCDL_wmma_bf16_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.bf16.16x16x16.bf16">;
def ROCDL_wmma_i32_16x16x16_iu8 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu8">;		def ROCDL_wmma_i32_16x16x16_iu8 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu8">;
def ROCDL_wmma_i32_16x16x16_iu4 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu4">;		def ROCDL_wmma_i32_16x16x16_iu4 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu4">;

▲ Show 20 Lines • Show All 167 Lines • Show Last 20 Lines

mlir/lib/Conversion/CMakeLists.txt

	Show All 10 Lines
	add_subdirectory(ComplexToLLVM)			add_subdirectory(ComplexToLLVM)
	add_subdirectory(ComplexToSPIRV)			add_subdirectory(ComplexToSPIRV)
	add_subdirectory(ComplexToStandard)			add_subdirectory(ComplexToStandard)
	add_subdirectory(ControlFlowToLLVM)			add_subdirectory(ControlFlowToLLVM)
	add_subdirectory(ControlFlowToSPIRV)			add_subdirectory(ControlFlowToSPIRV)
	add_subdirectory(FuncToLLVM)			add_subdirectory(FuncToLLVM)
	add_subdirectory(FuncToSPIRV)			add_subdirectory(FuncToSPIRV)
	add_subdirectory(GPUCommon)			add_subdirectory(GPUCommon)
				add_subdirectory(GPUToAMDGPU)
	add_subdirectory(GPUToNVVM)			add_subdirectory(GPUToNVVM)
	add_subdirectory(GPUToROCDL)			add_subdirectory(GPUToROCDL)
	add_subdirectory(GPUToSPIRV)			add_subdirectory(GPUToSPIRV)
	add_subdirectory(GPUToVulkan)			add_subdirectory(GPUToVulkan)
	add_subdirectory(IndexToLLVM)			add_subdirectory(IndexToLLVM)
	add_subdirectory(LinalgToStandard)			add_subdirectory(LinalgToStandard)
	add_subdirectory(LLVMCommon)			add_subdirectory(LLVMCommon)
	add_subdirectory(MathToFuncs)			add_subdirectory(MathToFuncs)
	Show All 30 Lines

mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt

This file was added.

				add_mlir_conversion_library(MLIRGPUToAMDGPUTransforms
				LowerGPUOpsToAMDGPUOps.cpp
				WmmaOpsToAMDGPU.cpp

				DEPENDS
				MLIRConversionPassIncGen

				LINK_LIBS PUBLIC
				MLIRArithToLLVM
				MLIRFuncToLLVM
				MLIRGPUDialect
				MLIRGPUToGPURuntimeTransforms
				MLIRLLVMCommonConversion
				MLIRLLVMDialect
				MLIRMemRefToLLVM
				MLIRROCDLDialect
				MLIRPass
				)

mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp

This file was added.

				//===- LowerGpuOpsToAMDGPUOps.cpp - MLIR GPU to AMD GPU lowering passes ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements a pass to generate AMDGPU operations for higher-level
				// GPU operations.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
				#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
				#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
				#include "mlir/Dialect/AMDGPU/Transforms/Passes.h"
				#include "mlir/Dialect/Func/IR/FuncOps.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"

				namespace mlir {
				#define GEN_PASS_DEF_CONVERTGPUOPSTOAMDGPUOPS
				#include "mlir/Conversion/Passes.h.inc"
				} // namespace mlir

				using namespace mlir;

				namespace {
				struct LowerGpuOpsToAMDGPUOpsPass
				: public impl::ConvertGpuOpsToAMDGPUOpsBase<LowerGpuOpsToAMDGPUOpsPass> {
				LowerGpuOpsToAMDGPUOpsPass() = default;
				LowerGpuOpsToAMDGPUOpsPass(const std::string &chipset, unsigned indexBitwidth,
				bool opSelect, unsigned warpSize) {
				if (this->chipset.getNumOccurrences() == 0)
				this->chipset = chipset;
				if (this->indexBitwidth.getNumOccurrences() == 0)
				this->indexBitwidth = indexBitwidth;
				if (this->warpSize.getNumOccurrences() == 0)
				this->warpSize = warpSize;
				}

				void runOnOperation() override {
				gpu::GPUModuleOp m = getOperation();
				MLIRContext *ctx = m.getContext();

				// Request C wrapper emission.
				for (auto func : m.getOps<func::FuncOp>()) {
				func->setAttr(LLVM::LLVMDialect::getEmitCWrapperAttrName(),
				UnitAttr::get(ctx));
				}

				FailureOr<amdgpu::Chipset> maybeChipset = amdgpu::Chipset::parse(chipset);
				if (failed(maybeChipset)) {
				emitError(UnknownLoc::get(ctx), "Invalid chipset name: " + chipset);
				return signalPassFailure();
				}

				TypeConverter converter;

				RewritePatternSet amdgpuPatterns(ctx);

				populateGpuToAMDGPUConversionPatterns(converter, amdgpuPatterns,
				this->chipset, this->indexBitwidth,
				this->opSelect, this->warpSize);
				ConversionTarget target(*ctx);
				// We do not mark GPU dialect illegal as other GPU ops and WMMA ops
				// unsupported by pattersn defined here are still allowed.
				target.addLegalDialect<amdgpu::AMDGPUDialect>();

				if (failed(applyPartialConversion(m, target, std::move(amdgpuPatterns))))
				signalPassFailure();
				}
				};

				} // namespace

				void mlir::populateGpuToAMDGPUConversionPatterns(
				TypeConverter &converter, RewritePatternSet &patterns, StringRef chipset,
				unsigned indexBitwidth, bool opSelect, unsigned warpSize) {
				// Lowering for MMAMatrixType.
				converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
				return amd::convertWMMAToROCDLLLVMType(type);
				});

				// We need to add target and source materializations so that the IR still
				// remains valid after the `gpu.mma_matrix` type conversion is done.
				auto buildUnrealizedCast = [](OpBuilder &builder, Type type,
				ValueRange inputs, Location loc) {
				auto cast = builder.create<UnrealizedConversionCastOp>(loc, type, inputs);
				return std::optional<Value>(cast.getResult(0));
				};
				converter.addSourceMaterialization(buildUnrealizedCast);
				converter.addTargetMaterialization(buildUnrealizedCast);

				/// Collect a set of patterns to convert WMMA ops from GPU dialect to NVVM.
				populateGpuWMMAToAMDGPUConversionPatterns(converter, patterns, chipset,
				indexBitwidth, opSelect, warpSize);
				}

				std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
				mlir::createLowerGpuOpsToAMDGPUOpsPass(const std::string &chipset,
				unsigned indexBitwidth, bool opSelect,
				unsigned warpSize) {
				return std::make_unique<LowerGpuOpsToAMDGPUOpsPass>(chipset, indexBitwidth,
				opSelect, warpSize);
				}

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp

This file was added.

				//===-------- WmmaOpsToAMDGPU.cpp - GPU WMMA ops to AMD GPU lowering ------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains definitions of patterns to lower GPU Subgroup MMA ops to
				// AMD GPU Dialect.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h"
				#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
				#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"

				using namespace mlir;

				namespace {

				static LogicalResult areAllVectorTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return isa<mlir::VectorType>(value.getType());
				})) {
				return rewriter.notifyMatchFailure(
				op, "cannot convert if operands aren't of Vector type.");
				}

				return success();
				}

				/// Create a WMMA compute intrinsic doing the multiply-add operation as :
				///
				/// `cOp` = `aOp` * `bOp` + `cOp`
				///
				/// and return the generated op in `computeOp`.
				static LogicalResult createWMMAComputeIntrinsic(Value aOp, Value bOp, Value cOp,
				Location loc, bool opSelect,
				PatternRewriter &rewriter,
				Value &computeOp) {
				Type aType = aOp.getType();
				Type bType = bOp.getType();
				Type cType = cOp.getType();

				// All the intrinsics present currently operate on vector types.
				auto checkVecType = [](Value value, StringRef op) {
				if (!isa<VectorType>(value.getType())) {
				return mlir::emitError(value.getLoc(), op + "should be of vector type");
				}
				return InFlightDiagnostic();
				};

				if (failed(checkVecType(aOp, "aOp")))
				return failure();
				if (failed(checkVecType(bOp, "bOp")))
				return failure();
				if (failed(checkVecType(cOp, "cOp")))
				return failure();

				auto aVecType = aType.cast<VectorType>();
				auto bVecType = bType.cast<VectorType>();
				auto cVecType = cType.cast<VectorType>();

				if (aVecType != bVecType)
				return emitError(aOp.getLoc(), "aOp and bOp must be of same type");

				Type aEltType = aVecType.getElementType();
				Type cEltType = cVecType.getElementType();

				// We support lowering for the mixed-precision and full fp16 WMMA intrinsics
				// currently.
				if (aEltType.isF16() && cEltType.isF32()) {
				// subwordOffset is always false for F32 `C` operands as they occupy all 32
				// bits in the VGPR.
				computeOp = rewriter.create<amdgpu::WMMAOp>(loc, cType, aOp, bOp, cOp,
				/subwordOffset=/false);
				return success();
				}
				if (aEltType.isF16() && cEltType.isF16()) {
				computeOp =
				rewriter.create<amdgpu::WMMAOp>(loc, cType, aOp, bOp, cOp, opSelect);
				krzysz00Unsubmitted Done Reply Inline Actions (To clarify, this could be two compute ops, one with opSelect = false and one with opSelect = true) krzysz00: (To clarify, this could be two compute ops, one with opSelect = false and one with opSelect =…
				return success();
				}

				return failure();
				}

				/// This class implements the conversion of GPU MMA computeOp to wmma.mma op
				/// in the ROCDL dialect.
				struct WmmaMmaOpToAMDGPULowering
				: public OpConversionPattern<gpu::SubgroupMmaComputeOp> {
				WmmaMmaOpToAMDGPULowering(TypeConverter &typeConverter, MLIRContext *context,
				StringRef chip, unsigned indexBitwidth,
				bool opSelect, unsigned warpSize)
				: OpConversionPattern<gpu::SubgroupMmaComputeOp>::OpConversionPattern(
				typeConverter, context),
				indexBitwidth(indexBitwidth), warpSize(warpSize), opSelect(opSelect),
				chip(chip){};

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaComputeOp subgroupMmaComputeOp,
				OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				if (failed(areAllVectorTypes(subgroupMmaComputeOp.getOperation(),
				adaptor.getOperands(), rewriter)))
				return failure();

				if (chip != "gfx1100")
				return subgroupMmaComputeOp->emitError(
				"wmma lowering is supported for gfx1100 only");

				if (warpSize != amd::kWaveFrontSize32)
				return subgroupMmaComputeOp->emitError(
				"wavefront of size 32 only supported");

				auto aTranspose = subgroupMmaComputeOp.getATranspose();
				auto bTranspose = subgroupMmaComputeOp.getBTranspose();

				if ((aTranspose.has_value() && aTranspose.value()) \|\|
				(bTranspose.has_value() && bTranspose.value()))
				return subgroupMmaComputeOp->emitError(
				"lowering with transpose is not supported. Please "
				"use transpose while loading/storing the operands.");

				Location loc = subgroupMmaComputeOp->getLoc();

				gpu::MMAMatrixType aType =
				subgroupMmaComputeOp.getOpA().getType().cast<gpu::MMAMatrixType>();
				gpu::MMAMatrixType bType =
				subgroupMmaComputeOp.getOpA().getType().cast<gpu::MMAMatrixType>();
				gpu::MMAMatrixType cType =
				subgroupMmaComputeOp.getOpC().getType().cast<gpu::MMAMatrixType>();

				SmallVector<gpu::MMAMatrixType> allTypes = {aType, bType, cType};

				SmallVector<int64_t> aTypeShape(aType.getShape());
				SmallVector<int64_t> bTypeShape(bType.getShape());
				SmallVector<int64_t> cTypeShape(cType.getShape());
				SmallVector<SmallVector<int64_t>> allShapes = {aTypeShape, bTypeShape,
				cTypeShape};

				if (!llvm::all_of(allShapes, [](ArrayRef<int64_t> shape) {
				return llvm::all_of(shape, [](int dim) { return dim == 16; });
				}))
				return subgroupMmaComputeOp->emitError(
				"wmma ops of shape 16x16x16 are only supported.");

				// Get the WMMA intrinsic to map to.
				Value computeOp;
				if (failed(createWMMAComputeIntrinsic(adaptor.getOpA(), adaptor.getOpB(),
				adaptor.getOpC(), loc, opSelect,
				rewriter, computeOp)))
				return rewriter.notifyMatchFailure(subgroupMmaComputeOp,
				"unsupported mma op variant.");

				rewriter.replaceOp(subgroupMmaComputeOp, computeOp);
				return success();
				}

				/// Index bitwidth to use in any index calculation.
				unsigned indexBitwidth;
				krzysz00Unsubmitted Done Reply Inline Actions Do we need this here? (also, if you'll have this here, you'll want the data layout incantations somewhere in the pass) krzysz00: Do we need this here? (also, if you'll have this here, you'll want the data layout…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions This can be dropped actually. Thanks. navdeepkk: This can be dropped actually. Thanks.

				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				unsigned warpSize;

				/// `opSelect` is used to decide whether to use lower half or upper half of
				/// the 32-bit registers to use for storing half precision C operand.
				bool opSelect;

				/// The target chip for which to generate the lowering.
				std::string chip;
				};

				} // namespace

				void mlir::populateGpuWMMAToAMDGPUConversionPatterns(
				TypeConverter &converter, RewritePatternSet &patterns, StringRef chip,
				unsigned indexBitwidth, bool opSelect, unsigned warpSize) {
				patterns.add<WmmaMmaOpToAMDGPULowering>(converter, patterns.getContext(),
				chip, indexBitwidth, opSelect,
				warpSize);
				}

mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt

	set(LLVM_TARGET_DEFINITIONS GPUToROCDL.td)			set(LLVM_TARGET_DEFINITIONS GPUToROCDL.td)
	mlir_tablegen(GPUToROCDL.cpp.inc -gen-rewriters)			mlir_tablegen(GPUToROCDL.cpp.inc -gen-rewriters)
	add_public_tablegen_target(MLIRGPUToROCDLIncGen)			add_public_tablegen_target(MLIRGPUToROCDLIncGen)

	add_mlir_conversion_library(MLIRGPUToROCDLTransforms			add_mlir_conversion_library(MLIRGPUToROCDLTransforms
	LowerGpuOpsToROCDLOps.cpp			LowerGpuOpsToROCDLOps.cpp
				WmmaOpsToROCDL.cpp

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen
	MLIRGPUToROCDLIncGen			MLIRGPUToROCDLIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRArithToLLVM			MLIRArithToLLVM
	MLIRAMDGPUToROCDL			MLIRAMDGPUToROCDL
	Show All 9 Lines

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

Show All 11 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"		#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"
#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"		#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"

#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"		#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"
#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"		#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"
#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"		#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"
		#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"		#include "mlir/Conversion/LLVMCommon/LoweringOptions.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Conversion/LLVMCommon/TypeConverter.h"		#include "mlir/Conversion/LLVMCommon/TypeConverter.h"
#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"		#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"
#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"		#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/GPU/Transforms/Passes.h"		#include "mlir/Dialect/GPU/Transforms/Passes.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"		#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
#include "mlir/Dialect/Math/IR/Math.h"		#include "mlir/Dialect/Math/IR/Math.h"
		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"		#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/IR/BuiltinAttributes.h"		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
#include "mlir/Transforms/DialectConversion.h"		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"

#include "../GPUCommon/GPUOpsLowering.h"		#include "../GPUCommon/GPUOpsLowering.h"
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
// corresponding ROCDL equivalent.		// corresponding ROCDL equivalent.
//		//
// This pass only handles device code and is not meant to be run on GPU host		// This pass only handles device code and is not meant to be run on GPU host
// code.		// code.
struct LowerGpuOpsToROCDLOpsPass		struct LowerGpuOpsToROCDLOpsPass
: public impl::ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {		: public impl::ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {
LowerGpuOpsToROCDLOpsPass() = default;		LowerGpuOpsToROCDLOpsPass() = default;
LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,		LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,
bool useBarePtrCallConv,		bool useBarePtrCallConv, gpu::amd::Runtime runtime,
gpu::amd::Runtime runtime) {		bool opSelect, unsigned warpSize) {
if (this->chipset.getNumOccurrences() == 0)		if (this->chipset.getNumOccurrences() == 0)
this->chipset = chipset;		this->chipset = chipset;
if (this->indexBitwidth.getNumOccurrences() == 0)		if (this->indexBitwidth.getNumOccurrences() == 0)
this->indexBitwidth = indexBitwidth;		this->indexBitwidth = indexBitwidth;
if (this->useBarePtrCallConv.getNumOccurrences() == 0)		if (this->useBarePtrCallConv.getNumOccurrences() == 0)
this->useBarePtrCallConv = useBarePtrCallConv;		this->useBarePtrCallConv = useBarePtrCallConv;
if (this->runtime.getNumOccurrences() == 0)		if (this->runtime.getNumOccurrences() == 0)
this->runtime = runtime;		this->runtime = runtime;
		if (this->warpSize.getNumOccurrences() == 0)
		this->warpSize = warpSize;
}		}

void runOnOperation() override {		void runOnOperation() override {
gpu::GPUModuleOp m = getOperation();		gpu::GPUModuleOp m = getOperation();
MLIRContext *ctx = m.getContext();		MLIRContext *ctx = m.getContext();

// Request C wrapper emission.		// Request C wrapper emission.
for (auto func : m.getOps<func::FuncOp>()) {		for (auto func : m.getOps<func::FuncOp>()) {
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	void runOnOperation() override {

mlir::arith::populateArithToLLVMConversionPatterns(converter, llvmPatterns);		mlir::arith::populateArithToLLVMConversionPatterns(converter, llvmPatterns);
populateAMDGPUToROCDLConversionPatterns(converter, llvmPatterns,		populateAMDGPUToROCDLConversionPatterns(converter, llvmPatterns,
*maybeChipset);		*maybeChipset);
populateVectorToLLVMConversionPatterns(converter, llvmPatterns);		populateVectorToLLVMConversionPatterns(converter, llvmPatterns);
cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);		cf::populateControlFlowToLLVMConversionPatterns(converter, llvmPatterns);
populateFuncToLLVMConversionPatterns(converter, llvmPatterns);		populateFuncToLLVMConversionPatterns(converter, llvmPatterns);
populateFinalizeMemRefToLLVMConversionPatterns(converter, llvmPatterns);		populateFinalizeMemRefToLLVMConversionPatterns(converter, llvmPatterns);
populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime);		populateGpuToROCDLConversionPatterns(converter, llvmPatterns, runtime,
		this->chipset, this->indexBitwidth,
		this->opSelect, this->warpSize);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
configureGpuToROCDLConversionLegality(target);		configureGpuToROCDLConversionLegality(target);
if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))		if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))
signalPassFailure();		signalPassFailure();

// Manually rewrite known block size attributes so the LLVMIR translation		// Manually rewrite known block size attributes so the LLVMIR translation
// infrastructure can pick them up.		// infrastructure can pick them up.
m.walk([ctx](LLVM::LLVMFuncOp op) {		m.walk([ctx](LLVM::LLVMFuncOp op) {
Show All 36 Lines	static void populateOpPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns, StringRef f32Func,		RewritePatternSet &patterns, StringRef f32Func,
StringRef f64Func) {		StringRef f64Func) {
patterns.add<ScalarizeVectorOpLowering<OpTy>>(converter);		patterns.add<ScalarizeVectorOpLowering<OpTy>>(converter);
patterns.add<OpToFuncCallLowering<OpTy>>(converter, f32Func, f64Func);		patterns.add<OpToFuncCallLowering<OpTy>>(converter, f32Func, f64Func);
}		}

void mlir::populateGpuToROCDLConversionPatterns(		void mlir::populateGpuToROCDLConversionPatterns(
LLVMTypeConverter &converter, RewritePatternSet &patterns,		LLVMTypeConverter &converter, RewritePatternSet &patterns,
mlir::gpu::amd::Runtime runtime) {		mlir::gpu::amd::Runtime runtime, StringRef chipset, unsigned indexBitwidth,
		bool opSelect, unsigned warpSize) {
using mlir::gpu::amd::Runtime;		using mlir::gpu::amd::Runtime;

		// Lowering for MMAMatrixType.
		converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
		return amd::convertWMMAToROCDLLLVMType(type);
		});

populateWithGenerated(patterns);		populateWithGenerated(patterns);
patterns		patterns
.add<GPUIndexIntrinsicOpLowering<gpu::ThreadIdOp, ROCDL::ThreadIdXOp,		.add<GPUIndexIntrinsicOpLowering<gpu::ThreadIdOp, ROCDL::ThreadIdXOp,
ROCDL::ThreadIdYOp, ROCDL::ThreadIdZOp>>(		ROCDL::ThreadIdYOp, ROCDL::ThreadIdZOp>>(
converter, gpu::GPUFuncOp::getKnownBlockSizeAttrName());		converter, gpu::GPUFuncOp::getKnownBlockSizeAttrName());
patterns.add<GPUIndexIntrinsicOpLowering<		patterns.add<GPUIndexIntrinsicOpLowering<
gpu::BlockIdOp, ROCDL::BlockIdXOp, ROCDL::BlockIdYOp, ROCDL::BlockIdZOp>>(		gpu::BlockIdOp, ROCDL::BlockIdXOp, ROCDL::BlockIdYOp, ROCDL::BlockIdZOp>>(
converter, gpu::GPUFuncOp::getKnownGridSizeAttrName());		converter, gpu::GPUFuncOp::getKnownGridSizeAttrName());
Show All 13 Lines	if (Runtime::HIP == runtime) {
patterns.add<GPUPrintfOpToHIPLowering>(converter);		patterns.add<GPUPrintfOpToHIPLowering>(converter);
} else if (Runtime::OpenCL == runtime) {		} else if (Runtime::OpenCL == runtime) {
// Use address space = 4 to match the OpenCL definition of printf()		// Use address space = 4 to match the OpenCL definition of printf()
patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);		patterns.add<GPUPrintfOpToLLVMCallLowering>(converter, /addressSpace=/4);
}		}

patterns.add<GPULaneIdOpToROCDL>(converter);		patterns.add<GPULaneIdOpToROCDL>(converter);

		/// Collect a set of patterns to convert WMMA ops from GPU dialect to ROCDL.
		populateGpuWMMAToROCDLConversionPatterns(converter, patterns, chipset,
		indexBitwidth, opSelect, warpSize);

populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",		populateOpPatterns<math::AbsFOp>(converter, patterns, "__ocml_fabs_f32",
"__ocml_fabs_f64");		"__ocml_fabs_f64");
populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",		populateOpPatterns<math::AtanOp>(converter, patterns, "__ocml_atan_f32",
"__ocml_atan_f64");		"__ocml_atan_f64");
populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",		populateOpPatterns<math::Atan2Op>(converter, patterns, "__ocml_atan2_f32",
"__ocml_atan2_f64");		"__ocml_atan2_f64");
populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",		populateOpPatterns<math::CbrtOp>(converter, patterns, "__ocml_cbrt_f32",
"__ocml_cbrt_f64");		"__ocml_cbrt_f64");
Show All 32 Lines	void mlir::populateGpuToROCDLConversionPatterns(
populateOpPatterns<math::ErfOp>(converter, patterns, "__ocml_erf_f32",		populateOpPatterns<math::ErfOp>(converter, patterns, "__ocml_erf_f32",
"__ocml_erf_f64");		"__ocml_erf_f64");
}		}

std::unique_ptr<OperationPass<gpu::GPUModuleOp>>		std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,		mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,
unsigned indexBitwidth,		unsigned indexBitwidth,
bool useBarePtrCallConv,		bool useBarePtrCallConv,
gpu::amd::Runtime runtime) {		gpu::amd::Runtime runtime, bool opSelect,
		unsigned warpSize) {
return std::make_unique<LowerGpuOpsToROCDLOpsPass>(		return std::make_unique<LowerGpuOpsToROCDLOpsPass>(
chipset, indexBitwidth, useBarePtrCallConv, runtime);		chipset, indexBitwidth, useBarePtrCallConv, runtime, opSelect, warpSize);
}		}

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp

This file was added.

				//===--------- WmmaOpsToROCDL.cpp - GPU WMMA ops to ROCDL lowering --------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains definitions of patterns to lower GPU Subgroup MMA ops to
				// ROCDL Dialect.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"
				#include "mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h"
				#include "mlir/Conversion/LLVMCommon/Pattern.h"
				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
				#include "mlir/IR/BuiltinAttributeInterfaces.h"
				#include "mlir/IR/PatternMatch.h"
				#include "mlir/Support/LLVM.h"

				using namespace mlir;

				namespace {

				/// Checks if all the operands of the op being lowered are of LLVM Types. The
				/// types are expected to be converted by the `LLVMTypeConverter` before the op
				/// is actually lowered. If the type of an operands is not already converted it
				/// hints a missing typeConversion and failure is returned in that case.
				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				})) {
				return rewriter.notifyMatchFailure(
				op, "cannot convert if operands aren't of LLVM type.");
				}

				return success();
				}

				/// Return the WMMA operand corresponding to `operandName`.
				static ROCDL::ROCDLWMMAFrag convertOperand(StringRef operandName) {
				if (operandName.equals("AOp"))
				return ROCDL::ROCDLWMMAFrag::a;
				if (operandName.equals("BOp"))
				return ROCDL::ROCDLWMMAFrag::b;
				if (operandName.equals("COp"))
				return ROCDL::ROCDLWMMAFrag::c;
				llvm_unreachable("Unknown operand name");
				}

				/// Generate load ops for `AOp` or `BOp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` lane ID of the thread loading the
				/// values. `vecType` is the vector type of the values that will be loaded. The
				/// loaded values are returned in `loadedValues`. The address for loading the
				/// values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// loadedValues[i] = dataPtr + ((wrappedLaneId * leadingDim) + i);
				/// }
				static void generateAbLoadOpsVecFirst(Location loc, Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				mlir::TypedAttr x;
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				Value laneIdLdm =
				rewriter.create<LLVM::MulOp>(loc, wrappedLaneId, leadingDim);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, laneIdLdm, iter);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, iter);
				}
				}

				/// Generate load ops for `AOp` or `BOp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` is the lane ID of the thread
				/// loading the values. `vecType` is the vector type of the values that will be
				/// loaded. The loaded values are returned in `loadedValues`. The address for
				/// loading the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// loadedValues[i] = dataPtr + ((i * leadingDim) + wrappedLaneId);
				/// }
				static void generateAbLoadOpsLaneFirst(Location loc, Value dataPtr,
				Value laneId, Value leadingDim,
				VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value iterLdm = rewriter.create<LLVM::MulOp>(loc, iter, leadingDim);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, iterLdm, wrappedLaneId);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, iter);
				}
				}

				/// Generate load ops for `COp`. `dataPtr` is the base address starting
				/// from which values will be loaded. `laneId` is the lane ID of the
				/// thread loading the values. `vecType` is the vector type of the values that
				/// will be loaded. The loaded values are returned in `loadedValues`. The
				/// address for loading the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// row = i * 2 + (laneId / 16)
				/// if opSelect
				/// loadedValues[i * 2 + 1] = dataPtr + ((row * leadingDim) +
				/// wrappedLaneId);
				/// else
				/// loadedValues[i * 2] = dataPtr + ((row * leadingDim) + wrappedLaneId);
				/// }
				static void generateCLoadOpsLaneFirst(bool opSelect, Location loc,
				krzysz00Unsubmitted Done Reply Inline Actions I'm not sure why these are `LaneFirst` functions. Could you expand on that? krzysz00:* I'm not sure why these are `*LaneFirst` functions. Could you expand on that?
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing. navdeepkk: It's just a naming convention I used as the laneId is the innermost position of the indexing…
				Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				loadedValues = rewriter.create<LLVM::UndefOp>(loc, vecType);
				Value constTwo = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/2);
				Value sixteen = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value laneIdHalf = rewriter.create<LLVM::SDivOp>(loc, laneId, sixteen);
				for (unsigned i = 0; i < vecType.getNumElements(); ++i) {
				Value iter = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value iterTwo = rewriter.create<LLVM::MulOp>(loc, iter, constTwo);
				Value row = rewriter.create<LLVM::AddOp>(loc, iterTwo, laneIdHalf);
				Value rowLdm = rewriter.create<LLVM::MulOp>(loc, row, leadingDim);
				Value curInx = rewriter.create<LLVM::AddOp>(loc, rowLdm, wrappedLaneId);
				Value curAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, curInx);
				// Load the value from the current index.
				Value loaded = rewriter.create<LLVM::LoadOp>(loc, vecType.getElementType(),
				curAddress);
				// We have to skip every second element if opselect is true.
				Value inx = iter;
				if (vecType.getElementType().isF16()) {
				if (opSelect) {
				Value constOne =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/1);
				inx = rewriter.create<LLVM::AddOp>(loc, iterTwo, constOne);
				} else {
				inx = iterTwo;
				}
				}
				loadedValues = rewriter.create<LLVM::InsertElementOp>(
				loc, vecType, loadedValues, loaded, inx);
				}
				}

				/// Generate load ops for `AOp`, `BOp`, or `COp`. `opSelect` is the opSelect bit
				/// governing how to store/load half precision `COp` values. `transpose` tells
				/// if the matrix has to be loaded in a transposed manner. `frag` is the type of
				/// the WMMA operand being loaded. `dataPtr` is the base address starting from
				/// which values will be loaded. `vecType` is the vector type of the values that
				/// will be loaded. The loaded values are returned in `loadedValues`.
				static LogicalResult generateLoadOps(bool opSelect, bool transpose,
				Location loc, ROCDL::ROCDLWMMAFrag frag,
				unsigned indexBitwidth, Value dataPtr,
				Value leadingDim, VectorType vecType,
				PatternRewriter &rewriter,
				Value &loadedValues) {
				Value laneId = amd::getLaneId(loc, rewriter, indexBitwidth);
				Type eltType = vecType.getElementType();
				if (frag == ROCDL::ROCDLWMMAFrag::a && !transpose && eltType.isF16()) {
				generateAbLoadOpsVecFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::a && transpose && eltType.isF16()) {
				generateAbLoadOpsLaneFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::b && transpose && eltType.isF16()) {
				generateAbLoadOpsVecFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::b && !transpose && eltType.isF16()) {
				generateAbLoadOpsLaneFirst(loc, dataPtr, laneId, leadingDim, vecType,
				rewriter, loadedValues);
				return success();
				}
				if (frag == ROCDL::ROCDLWMMAFrag::c && !transpose &&
				(eltType.isF32() \|\| eltType.isF16())) {
				generateCLoadOpsLaneFirst(opSelect, loc, dataPtr, laneId, leadingDim,
				vecType, rewriter, loadedValues);
				return success();
				}

				return failure();
				}

				/// This class implements the conversion of GPU MMA loadOp to wmma.load op
				/// in the ROCDL dialect. The conversion not only emits the ROCDL op but also
				/// emits code that is necessary to store the data in the destination memref
				/// after it has been loaded.
				struct WmmaLoadOpToROCDLLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp> {
				WmmaLoadOpToROCDLLowering(LLVMTypeConverter &typeConverter, StringRef chip,
				unsigned indexBitwidth, bool opSelect,
				unsigned warpSize)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp>(typeConverter),
				indexBitwidth(indexBitwidth), warpSize(warpSize), opSelect(opSelect),
				chip(chip){};

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
				OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				if (failed(areAllLLVMTypes(subgroupMmaLoadMatrixOp.getOperation(),
				adaptor.getOperands(), rewriter)))
				return failure();

				if (chip != "gfx1100")
				return subgroupMmaLoadMatrixOp->emitError(
				"wmma lowering is supported for gfx1100 only");

				if (warpSize != amd::kWaveFrontSize32)
				return subgroupMmaLoadMatrixOp->emitError(
				"wavefront of size 32 only supported");
				krzysz00Unsubmitted Done Reply Inline Actions Grammar nit here, perhaps "only size 32 wavefronts are supported" krzysz00: Grammar nit here, perhaps "only size 32 wavefronts are supported"

				auto transpose = subgroupMmaLoadMatrixOp.getTranspose();
				gpu::MMAMatrixType retType =
				subgroupMmaLoadMatrixOp.getRes().getType().cast<gpu::MMAMatrixType>();
				SmallVector<int64_t> retTypeShape(retType.getShape());

				if (!llvm::all_of(retTypeShape, [](int dim) { return dim == 16; }))
				return subgroupMmaLoadMatrixOp->emitError(
				"wmma ops of shape 16x16x16 are only supported.");

				auto srcMemrefType =
				subgroupMmaLoadMatrixOp.getSrcMemref().getType().cast<MemRefType>();

				if (srcMemrefType.getElementType() != retType.getElementType())
				return subgroupMmaLoadMatrixOp->emitError(
				"src memref type and mma matrix element type must be same");

				// Get the LLVM type of corresponding to the result MMAMatrixType.
				Type llvmRetType = amd::convertWMMAToROCDLLLVMType(retType);

				// We need to declare a vector type and then emit instructions to load the
				// elements into the vector type.
				Location loc = subgroupMmaLoadMatrixOp.getLoc();
				Value dataPtr =
				getStridedElementPtr(loc, srcMemrefType, adaptor.getSrcMemref(),
				adaptor.getIndices(), rewriter);

				Value leadingDim = rewriter.create<LLVM::ConstantOp>(
				loc, rewriter.getI32Type(),
				subgroupMmaLoadMatrixOp.getLeadDimensionAttr());

				Value loadedValues;
				ROCDL::ROCDLWMMAFrag operand = convertOperand(retType.getOperand());
				if (auto vecType = dyn_cast<VectorType>(llvmRetType)) {
				if (failed(generateLoadOps(opSelect,
				transpose.has_value() && transpose.value(),
				loc, operand, indexBitwidth, dataPtr,
				leadingDim, vecType, rewriter, loadedValues)))
				return rewriter.notifyMatchFailure(subgroupMmaLoadMatrixOp,
				"unsupported load op variant.");
				rewriter.replaceOp(subgroupMmaLoadMatrixOp, loadedValues);
				return success();
				}
				return rewriter.notifyMatchFailure(subgroupMmaLoadMatrixOp,
				"unsupported load op variant.");
				}

				/// Index bitwidth to use in any index calculation.
				unsigned indexBitwidth;

				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				unsigned warpSize;

				/// `opSelect` is used to decide whether to use lower half or upper half of
				/// the 32-bit registers to use for storing half precision C operand.
				bool opSelect;

				/// The target chip for which to generate the lowering.
				std::string chip;
				};

				/// Generate store ops for `COp`. `dataPtr` is the base address starting
				/// to which the values will be stored. `laneId` is the lane ID of the
				/// thread loading the values. `vecType` is the vector type of the values that
				/// are being stored. The values to be stored are supplied in `toStore`. The
				/// address for storing the values is generated in the following manner:
				///
				/// wrappedLaneId = laneId % 16
				/// for i in vectorSize {
				/// row = i * 2 + (laneId / 16)
				/// if opSelect
				/// store toStore[i * 2 + 1], dataPtr + ((row * leadingDim) + wrappedLaneId)
				/// else
				/// store toStore[i * 2], dataPtr + ((row * leadingDim) + wrappedLaneId)
				/// }
				static void generateCStoreOpsLaneFirst(bool opSelect, Location loc,
				Value dataPtr, Value laneId,
				Value leadingDim, VectorType vecType,
				Value toStore,
				PatternRewriter &rewriter) {
				// We wrap the laneId to 16 because of matrix replication in RDNA 3.
				Value wrapSize = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value wrappedLaneId = rewriter.create<LLVM::SRemOp>(loc, laneId, wrapSize);
				Value constSixteen =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/16);
				Value constTwo = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/2);
				Value laneIdHalf = rewriter.create<LLVM::SDivOp>(loc, laneId, constSixteen);
				for (int i = 0; i < vecType.getNumElements(); ++i) {
				Value inx = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/i);
				Value inxTimesTwo = rewriter.create<LLVM::MulOp>(loc, inx, constTwo);
				Value row = rewriter.create<LLVM::AddOp>(loc, laneIdHalf, inxTimesTwo);
				Value rowLdm = rewriter.create<LLVM::MulOp>(loc, row, leadingDim);
				Value offset = rewriter.create<LLVM::AddOp>(loc, rowLdm, wrappedLaneId);
				Value storeAddress = rewriter.create<LLVM::GEPOp>(
				loc, dataPtr.getType(), vecType.getElementType(), dataPtr, offset);
				Value toStoreAtInx;
				if (vecType.getElementType().isF16()) {
				if (!opSelect) {
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inxTimesTwo);

				} else {
				Value constOne =
				rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI32Type(),
				/value=/1);
				Value inxTimesTwoAddOne =
				rewriter.create<LLVM::AddOp>(loc, inxTimesTwo, constOne);
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inxTimesTwoAddOne);
				}
				} else if (vecType.getElementType().isF32()) {
				toStoreAtInx = rewriter.create<LLVM::ExtractElementOp>(
				loc, vecType.getElementType(), toStore, inx);
				}
				rewriter.create<LLVM::StoreOp>(loc, toStoreAtInx, storeAddress);
				}
				}

				/// Generate store ops for `COp`. `opSelect` is the opSelect bit governing how
				/// to store half precision `COp` values. `frag` is the type of the WMMA
				/// operand being stored. `dataPtr` is the base address starting from which
				/// starting from which the values will be stored. `vecType` is the vector type
				/// of the values being stored. `toStore` contains the values to be stored.
				static LogicalResult generateStoreOps(bool opSelect, Location loc,
				ROCDL::ROCDLWMMAFrag frag, Value dataPtr,
				unsigned indexBitwidth, Value leadingDim,
				VectorType vecType, Value toStore,
				PatternRewriter &rewriter) {
				// Store ops can only be generated for C operands.
				if (frag != ROCDL::ROCDLWMMAFrag::c)
				return emitError(toStore.getLoc(), "only COp can be stored");

				// Get the laneID.
				Value laneId = amd::getLaneId(loc, rewriter, indexBitwidth);
				Type eltType = vecType.getElementType();
				if (eltType.isF16() \|\| eltType.isF32()) {
				generateCStoreOpsLaneFirst(opSelect, loc, dataPtr, laneId, leadingDim,
				vecType, toStore, rewriter);
				return success();
				}

				return failure();
				}

				/// This class implements the conversion of GPU MMA storeOp to wmma.store op
				/// in the ROCDL dialect.
				struct WmmaStoreOpToROCDLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp> {
				WmmaStoreOpToROCDLowering(LLVMTypeConverter &typeConverter, StringRef chip,
				unsigned indexBitwidth, bool opSelect,
				unsigned warpSize)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp>(typeConverter),
				indexBitwidth(indexBitwidth), warpSize(warpSize), opSelect(opSelect),
				chip(chip){};

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
				OpAdaptor adaptor,
				ConversionPatternRewriter &rewriter) const override {
				if (failed(areAllLLVMTypes(subgroupMmaStoreMatrixOp.getOperation(),
				adaptor.getOperands(), rewriter)))
				return failure();

				if (chip != "gfx1100")
				krzysz00Unsubmitted Done Reply Inline Actions That's too strong a restriction - WMMA is available on all the gfx11NN GPUs krzysz00: That's too strong a restriction - WMMA is available on all the gfx11NN GPUs
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested. navdeepkk: Are the thread data mappings the same for them too? Can you please point me to a reference?
				return subgroupMmaStoreMatrixOp->emitError(
				"wmma lowering is supported for gfx1100 only");

				if (warpSize != amd::kWaveFrontSize32)
				return subgroupMmaStoreMatrixOp->emitError(
				"wavefront of size 32 only supported");

				Location loc = subgroupMmaStoreMatrixOp->getLoc();

				auto transpose = subgroupMmaStoreMatrixOp.getTranspose();
				if (transpose.has_value() && transpose.value())
				return subgroupMmaStoreMatrixOp->emitError(
				"lowering with transpose is not supported.");

				gpu::MMAMatrixType retType =
				subgroupMmaStoreMatrixOp.getSrc().getType().cast<gpu::MMAMatrixType>();
				SmallVector<int64_t> retTypeShape(retType.getShape());

				if (!llvm::all_of(retTypeShape, [](int dim) { return dim == 16; }))
				return subgroupMmaStoreMatrixOp->emitError(
				"wmma ops of shape 16x16x16 are only supported.");

				auto dstMemrefType =
				subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>();

				if (dstMemrefType.getElementType() != retType.getElementType())
				return subgroupMmaStoreMatrixOp->emitError(
				"dst memref type and mma matrix element type must be same");

				Value dataPtr = getStridedElementPtr(
				loc,
				subgroupMmaStoreMatrixOp.getDstMemref().getType().cast<MemRefType>(),
				adaptor.getDstMemref(), adaptor.getIndices(), rewriter);
				Value leadingDim = rewriter.create<LLVM::ConstantOp>(
				loc, rewriter.getI32Type(),
				subgroupMmaStoreMatrixOp.getLeadDimensionAttr());

				// Get the LLVM type of corresponding to the result MMAMatrixType.
				Type llvmRetType = amd::convertWMMAToROCDLLLVMType(retType);

				Value toStore = adaptor.getSrc();

				if (auto vecType = dyn_cast<VectorType>(llvmRetType)) {
				if (failed(generateStoreOps(
				opSelect, loc, convertOperand(retType.getOperand()), dataPtr,
				indexBitwidth, leadingDim, vecType, toStore, rewriter)))
				return rewriter.notifyMatchFailure(subgroupMmaStoreMatrixOp,
				"unsupported store op variant.");
				}
				rewriter.eraseOp(subgroupMmaStoreMatrixOp);
				return success();
				}

				/// Index bitwidth to use in any index calculation.
				unsigned indexBitwidth;

				/// `warpSize` is the warp size to use when generating WMMA intrinsics.
				unsigned warpSize;

				/// `opSelect` is used to decide whether to use lower half or upper half of
				/// the 32-bit registers to use for storing half precision C operand.
				bool opSelect;
				krzysz00Unsubmitted Done Reply Inline Actions There's something suspicious about this being here krzysz00: There's something suspicious about this being here

				/// The target chip for which to generate the lowering.
				std::string chip;
				};
				} // namespace

				// Convert the MMAMatrix type to LLVM types based of the elemental type of
				// MMAMatrixType.
				Type mlir::amd::convertWMMAToROCDLLLVMType(
				mlir::gpu::MMAMatrixType matrixType) {
				Type eltType = matrixType.getElementType();
				ROCDL::ROCDLWMMAFrag frag = convertOperand(matrixType.getOperand());
				if (eltType.isF16() &&
				(frag == ROCDL::ROCDLWMMAFrag::a \|\| frag == ROCDL::ROCDLWMMAFrag::b \|\|
				frag == ROCDL::ROCDLWMMAFrag::c))
				return VectorType::get({16}, eltType);
				if (eltType.isF32() && frag == ROCDL::ROCDLWMMAFrag::c)
				return VectorType::get({8}, eltType);

				llvm_unreachable("Unsupported data type");
				}

				void mlir::populateGpuWMMAToROCDLConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns, StringRef chip,
				unsigned indexBitwidth, bool opSelect, unsigned warpSize) {
				patterns.add<WmmaLoadOpToROCDLLowering, WmmaStoreOpToROCDLowering>(
				converter, chip, indexBitwidth, opSelect, warpSize);
				}

mlir/test/CMakeLists.txt

Show All 23 Lines	set(ARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE "" CACHE STRING
"If arch-specific Arm integration tests run emulated, use this Arm native mlir-cpu-runner.")		"If arch-specific Arm integration tests run emulated, use this Arm native mlir-cpu-runner.")
set(ARM_EMULATOR_LLI_EXECUTABLE "" CACHE STRING		set(ARM_EMULATOR_LLI_EXECUTABLE "" CACHE STRING
"If arch-specific Arm integration tests run emulated, use this Arm native lli.")		"If arch-specific Arm integration tests run emulated, use this Arm native lli.")
set(ARM_EMULATOR_UTILS_LIB_DIR "" CACHE STRING		set(ARM_EMULATOR_UTILS_LIB_DIR "" CACHE STRING
"If arch-specific Arm integration tests run emulated, find Arm native utility libraries in this directory.")		"If arch-specific Arm integration tests run emulated, find Arm native utility libraries in this directory.")
option(MLIR_RUN_AMX_TESTS "Run AMX tests.")		option(MLIR_RUN_AMX_TESTS "Run AMX tests.")
option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")		option(MLIR_RUN_X86VECTOR_TESTS "Run X86Vector tests.")
option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")		option(MLIR_RUN_CUDA_TENSOR_CORE_TESTS "Run CUDA Tensor core WMMA tests.")
option(MLIR_RUN_CUDA_SM80_TESTS "Run CUDA A100 tests.")		option(MLIR_RUN_CUDA_SM80_TESTS "Run CUDA A100 tests.")
		krzysz00Unsubmitted Done Reply Inline Actions Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is known to support WMMA"? krzysz00: Could this be a derived flag that defaults to "ROCm integration tests enabled and `chipset` is…
option(MLIR_RUN_CUDA_SM80_LT_TESTS "Run CUDA A100 structured sparsity tests.")		option(MLIR_RUN_CUDA_SM80_LT_TESTS "Run CUDA A100 structured sparsity tests.")
option(MLIR_RUN_CUDA_SM90_TESTS "Run CUDA H100 tests.")		option(MLIR_RUN_CUDA_SM90_TESTS "Run CUDA H100 tests.")
option(MLIR_RUN_ARM_SVE_TESTS "Run Arm SVE tests.")		option(MLIR_RUN_ARM_SVE_TESTS "Run Arm SVE tests.")
option(MLIR_RUN_ARM_SME_TESTS "Run Arm SME tests.")		option(MLIR_RUN_ARM_SME_TESTS "Run Arm SME tests.")

		set(GFX_WMMA_TARGET "gfx1100")
		krzysz00Unsubmitted Not Done Reply Inline Actions Yeah, let's loosen this up and have it check for prefix gfx11 or something like that. krzysz00: Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right? navdeepkk: I have relaxed the checks in the passes. I think this variable has to have to complete name. If…

		if (ROCM_TEST_CHIPSET STREQUAL ${GFX_WMMA_TARGET})
		message(STATUS "Enabling MLIR_RUN_ROCM_WMMA_TESTS")
		set(MLIR_RUN_ROCM_WMMA_TESTS TRUE)
		krzysz00Unsubmitted Done Reply Inline Actions This should be a user-configurable option? krzysz00: This should be a user-configurable option?
		else ()
		message(STATUS "Disabling MLIR_RUN_ROCM_WMMA_TESTS")
		set(MLIR_RUN_ROCM_WMMA_TESTS FALSE)
		endif ()

# The native target may not be enabled when cross compiling, raise an error.		# The native target may not be enabled when cross compiling, raise an error.
if(NOT MLIR_ENABLE_EXECUTION_ENGINE)		if(NOT MLIR_ENABLE_EXECUTION_ENGINE)
message(FATAL_ERROR "MLIR_INCLUDE_INTEGRATION_TESTS requires a native target")		message(FATAL_ERROR "MLIR_INCLUDE_INTEGRATION_TESTS requires a native target")
endif()		endif()

# When the Integration tests are requested via the MLIR_INCLUDE_INTEGRATION_TESTS		# When the Integration tests are requested via the MLIR_INCLUDE_INTEGRATION_TESTS
# configuration flag, we automatically include sm80 tests when build for		# configuration flag, we automatically include sm80 tests when build for
Show All 16 Lines	llvm_canonicalize_cmake_booleans(
MLIR_ENABLE_CUDA_RUNNER		MLIR_ENABLE_CUDA_RUNNER
MLIR_ENABLE_ROCM_CONVERSIONS		MLIR_ENABLE_ROCM_CONVERSIONS
MLIR_ENABLE_ROCM_RUNNER		MLIR_ENABLE_ROCM_RUNNER
MLIR_ENABLE_SPIRV_CPU_RUNNER		MLIR_ENABLE_SPIRV_CPU_RUNNER
MLIR_ENABLE_VULKAN_RUNNER		MLIR_ENABLE_VULKAN_RUNNER
MLIR_INCLUDE_INTEGRATION_TESTS		MLIR_INCLUDE_INTEGRATION_TESTS
MLIR_RUN_AMX_TESTS		MLIR_RUN_AMX_TESTS
MLIR_RUN_CUDA_TENSOR_CORE_TESTS		MLIR_RUN_CUDA_TENSOR_CORE_TESTS
		MLIR_RUN_ROCM_WMMA_TESTS
MLIR_RUN_X86VECTOR_TESTS		MLIR_RUN_X86VECTOR_TESTS
MLIR_RUN_ARM_SVE_TESTS		MLIR_RUN_ARM_SVE_TESTS
MLIR_RUN_ARM_SME_TESTS		MLIR_RUN_ARM_SME_TESTS
MLIR_RUN_CUDA_SM80_TESTS		MLIR_RUN_CUDA_SM80_TESTS
MLIR_RUN_CUDA_SM80_LT_TESTS		MLIR_RUN_CUDA_SM80_LT_TESTS
MLIR_RUN_CUDA_SM90_TESTS		MLIR_RUN_CUDA_SM90_TESTS
)		)

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-chipset.mlir

This file was added.

				// RUN: mlir-opt %s -convert-gpu-to-amdgpu='chipset=gfx900 index-bitwidth=32' -split-input-file -verify-diagnostics

				gpu.module @test_module {
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// expected-error@-1 {{wmma lowering is supported for gfx1100 only}}
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-operands.mlir

This file was added.

				// This file tests the we error out properly when unsupported ops are
				// encountered for GPU wmma ops to ROCDL conversion.

				// RUN: mlir-opt %s -convert-gpu-to-amdgpu='chipset=gfx1100 index-bitwidth=32' -split-input-file -verify-diagnostics
				gpu.module @test_module {
				// CHECK-LABEL: compute_op_f32_f16_transpose
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 {a_transpose}: !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// expected-error@-1 {{lowering with transpose is not supported. Please use transpose while loading/storing the operands.}}
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: compute_op_f32_f16_transpose
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 {b_transpose}: !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// expected-error@-1 {{lowering with transpose is not supported. Please use transpose while loading/storing the operands.}}
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x8xf16, "AOp">, %arg1: !gpu.mma_matrix<8x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 : !gpu.mma_matrix<16x8xf16, "AOp">, !gpu.mma_matrix<8x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-warpsize.mlir

This file was added.

				// RUN: mlir-opt %s -convert-gpu-to-amdgpu='chipset=gfx1100 warp-size=64' -split-input-file -verify-diagnostics

				gpu.module @test_module {
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// expected-error@-1 {{wavefront of size 32 only supported}}
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu.mlir

This file was added.

				// This file tests the conversion of GPU WMMA ops to ROCDL dialect.
				// RUN: mlir-opt %s -convert-gpu-to-amdgpu='chipset=gfx1100 index-bitwidth=32' -split-input-file \| FileCheck %s
				// RUN: mlir-opt %s -convert-gpu-to-amdgpu='chipset=gfx1100 index-bitwidth=32 opselect' -split-input-file \| FileCheck %s --check-prefix=OPSEL

				gpu.module @test_module {
				// CHECK-LABEL: compute_op_f32_f16
				// CHECK-SAME: (%[[AOP:.]]: !gpu.mma_matrix<16x16xf16, "AOp">, %[[BOP:.]]: !gpu.mma_matrix<16x16xf16, "BOp">, %[[COP:.*]]: !gpu.mma_matrix<16x16xf32, "COp">)
				func.func @compute_op_f32_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf32, "COp">) -> (!gpu.mma_matrix<16x16xf32, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">
				// CHECK: %[[AOPMAT:.*]] = builtin.unrealized_conversion_cast %[[AOP]] : !gpu.mma_matrix<16x16xf16, "AOp"> to vector<16xf16>
				// CHECK-NEXT: %[[BOPMAT:.*]] = builtin.unrealized_conversion_cast %[[BOP]] : !gpu.mma_matrix<16x16xf16, "BOp"> to vector<16xf16>
				// CHECK-NEXT: %[[COPMAT:.*]] = builtin.unrealized_conversion_cast %[[COP]] : !gpu.mma_matrix<16x16xf32, "COp"> to vector<8xf32>
				// CHECK-NEXT: %[[OUTVEC:.]] = amdgpu.wmma %[[AOPMAT]] %[[BOPMAT]] + %[[COPMAT]] : vector<16xf16>, vector<16xf16>, vector<8xf32>
				// CHECK-NEXT: %[[OUTMAT:.*]] = builtin.unrealized_conversion_cast %[[OUTVEC]] : vector<8xf32> to !gpu.mma_matrix<16x16xf32, "COp">
				// CHECK-NEXT: return %[[OUTMAT]] : !gpu.mma_matrix<16x16xf32, "COp">
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: compute_op_f16_f16
				// CHECK-SAME: (%[[AOP:.]]: !gpu.mma_matrix<16x16xf16, "AOp">, %[[BOP:.]]: !gpu.mma_matrix<16x16xf16, "BOp">, %[[COP:.*]]: !gpu.mma_matrix<16x16xf16, "COp">)
				// OPSEL-LABEL: compute_op_f16_f16
				// OPSEL-SAME: (%[[AOP:.]]: !gpu.mma_matrix<16x16xf16, "AOp">, %[[BOP:.]]: !gpu.mma_matrix<16x16xf16, "BOp">, %[[COP:.*]]: !gpu.mma_matrix<16x16xf16, "COp">)
				func.func @compute_op_f16_f16(%arg0: !gpu.mma_matrix<16x16xf16, "AOp">, %arg1: !gpu.mma_matrix<16x16xf16, "BOp">, %arg2: !gpu.mma_matrix<16x16xf16, "COp">) -> (!gpu.mma_matrix<16x16xf16, "COp">) {
				%0 = gpu.subgroup_mma_compute %arg0, %arg1, %arg2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
				// CHECK: %[[AOPMAT:.*]] = builtin.unrealized_conversion_cast %[[AOP]] : !gpu.mma_matrix<16x16xf16, "AOp"> to vector<16xf16>
				// CHECK-NEXT: %[[BOPMAT:.*]] = builtin.unrealized_conversion_cast %[[BOP]] : !gpu.mma_matrix<16x16xf16, "BOp"> to vector<16xf16>
				// CHECK-NEXT: %[[COPMAT:.*]] = builtin.unrealized_conversion_cast %[[COP]] : !gpu.mma_matrix<16x16xf16, "COp"> to vector<16xf16>
				// CHECK-NEXT: %[[OUTVEC:.]] = amdgpu.wmma %[[AOPMAT]] %[[BOPMAT]] + %[[COPMAT]] : vector<16xf16>, vector<16xf16>, vector<16xf16>
				// CHECK-NEXT: %[[OUTMAT:.*]] = builtin.unrealized_conversion_cast %[[OUTVEC]] : vector<16xf16> to !gpu.mma_matrix<16x16xf16, "COp">
				// CHECK-NEXT: return %[[OUTMAT]] : !gpu.mma_matrix<16x16xf16, "COp">
				// OPSEL: %[[AOPMAT:.*]] = builtin.unrealized_conversion_cast %[[AOP]] : !gpu.mma_matrix<16x16xf16, "AOp"> to vector<16xf16>
				// OPSEL-NEXT: %[[BOPMAT:.*]] = builtin.unrealized_conversion_cast %[[BOP]] : !gpu.mma_matrix<16x16xf16, "BOp"> to vector<16xf16>
				// OPSEL-NEXT: %[[COPMAT:.*]] = builtin.unrealized_conversion_cast %[[COP]] : !gpu.mma_matrix<16x16xf16, "COp"> to vector<16xf16>
				// OPSEL-NEXT: %[[OUTVEC:.]] = amdgpu.wmma %[[AOPMAT]] %[[BOPMAT]] + %[[COPMAT]] {subwordOffset = 1 : i32} : vector<16xf16>, vector<16xf16>, vector<16xf16>
				// OPSEL-NEXT: %[[OUTMAT:.*]] = builtin.unrealized_conversion_cast %[[OUTVEC]] : vector<16xf16> to !gpu.mma_matrix<16x16xf16, "COp">
				// OPSEL-NEXT: return %[[OUTMAT]] : !gpu.mma_matrix<16x16xf16, "COp">
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir

This file was added.

				// This file tests the we error out properly when unsupported ops are
				// encountered for GPU wmma ops to ROCDL conversion.
				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx1100 index-bitwidth=32' -split-input-file -verify-diagnostics

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose_invalid_shape
				func.func @load_a_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "AOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_transpose_invalid_shape
				func.func @load_a_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "AOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose_invalid_types
				func.func @load_a_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose_invalid_shape
				func.func @load_b_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "BOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_transpose_invalid_shape
				func.func @load_b_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "BOp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose_invalid_types
				func.func @load_b_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_transpose_invalid_shape
				func.func @load_c_op_16_16_16_no_transpose()->(!gpu.mma_matrix<32x8xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "COp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_transpose_invalid_shape
				func.func @load_c_op_16_16_16_transpose()->(!gpu.mma_matrix<32x8xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<32x8xf16, "COp">
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<32x8xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_transpose_invalid_types
				func.func @load_c_op_16_16_16_no_transpose_invalid_types()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// expected-error@-1 {{src memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_load_matrix' that was explicitly marked illegal}}
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<32x8xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<32x8xf32, "COp">, memref<32x32xf32, 3>
				// expected-error@-1 {{wmma ops of shape 16x16x16 are only supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index, transpose} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf32, 3>
				// expected-error@-1 {{lowering with transpose is not supported.}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf16, 3>
				// expected-error@-1 {{dst memref type and mma matrix element type must be same}}
				// expected-error@-2 {{failed to legalize operation 'gpu.subgroup_mma_store_matrix' that was explicitly marked illegal}}
				return
				}
				}

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir

This file was added.

				// This file tests the conversion of GPU WMMA ops to ROCDL dialect.
				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx1100 index-bitwidth=32' -split-input-file \| FileCheck %s
				// RUN: mlir-opt %s -convert-gpu-to-rocdl='chipset=gfx1100 index-bitwidth=32 opselect' -split-input-file \| FileCheck %s --check-prefix=OPSEL

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_no_transpose
				func.func @load_a_op_16_16_16_no_transpose()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK-NEXT: %[[C0_I32:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[C_1_I32:.*]] = llvm.mlir.constant(-1 : i32) : i32
				// CHECK-NEXT: %[[MBCNT_LO:.*]] = rocdl.mbcnt.lo %[[C_1_I32]], %[[C0_I32]] : (i32, i32) -> i32
				// CHECK-NEXT: %[[WARPLOCALTID:.*]] = rocdl.mbcnt.hi %[[C_1_I32]], %[[MBCNT_LO]] : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// The part checked up to this point will be common in most of the WMMA op
				// lowerings. Checking all of these lines will be skipped in the subsequent
				// tests as the same utility emits the IR up to this point. Only some
				// values which are used later will be matched.
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[WRAPPEDTID32:.*]] = llvm.mul %[[WRAPPEDTID]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C0]] : i32
				// CHECK-NEXT: %[[LOADADDR0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[LOADADDR0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C1]] : i32
				// CHECK-NEXT: %[[LOADADDR1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[LOADADDR1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[OFFSET15:.]] = llvm.add %[[WRAPPEDTID32]], %{{.}} : i32
				// CHECK-NEXT: %[[LOADADDR15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADEDVALS15:.*]] = llvm.load %[[LOADADDR15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADEDVALS15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_a_op_16_16_16_transpose
				func.func @load_a_op_16_16_16_transpose()->(!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, {{.*}} : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.mul %[[C0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROW0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.mul %[[C1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROW1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK-NEXT: %[[ROW15:.*]] = llvm.mul %[[C15]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET15:.*]] = llvm.add %[[ROW15]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED15:.*]] = llvm.load %[[ADDRESS15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADED15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_no_transpose
				func.func @load_b_op_16_16_16_no_transpose()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, {{.*}} : i32
				// CHECK: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.mul %[[C0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROW0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.mul %[[C1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROW1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK-NEXT: %[[ROW15:.*]] = llvm.mul %[[C15]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET15:.*]] = llvm.add %[[ROW15]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED15:.*]] = llvm.load %[[ADDRESS15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADED15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_b_op_16_16_16_transpose
				func.func @load_b_op_16_16_16_transpose()->(!gpu.mma_matrix<16x16xf16, "BOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index, transpose} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "BOp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WRAPPEDTID:.]] = llvm.srem %{{.}}, %{{.*}} : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[WRAPPEDTID32:.*]] = llvm.mul %[[WRAPPEDTID]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C0]] : i32
				// CHECK-NEXT: %[[LOADADDR0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[LOADADDR0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[WRAPPEDTID32]], %[[C1]] : i32
				// CHECK-NEXT: %[[LOADADDR1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[LOADADDR1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[C1]] : i32] : vector<16xf16>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[OFFSET15:.]] = llvm.add %[[WRAPPEDTID32]], %{{.}} : i32
				// CHECK-NEXT: %[[LOADADDR15:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET15]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADEDVALS15:.*]] = llvm.load %[[LOADADDR15]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[RES:.]] = llvm.insertelement %[[LOADEDVALS15]], %{{.}}[%[[C15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "BOp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_opselect
				func.func @load_c_op_16_16_16_no_opselect()->(!gpu.mma_matrix<16x16xf32, "COp">) {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg_1[%i, %j] {leadDimension = 32 : index} : memref<32x32xf32, 3> -> !gpu.mma_matrix<16x16xf32, "COp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<8xf32>
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f32
				// CHECK-NEXT: %[[LOADEDVAL0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[C0]] : i32] : vector<8xf32>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f32
				// CHECK-NEXT: %[[LOADEDVAL1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVAL0]][%[[C1]] : i32] : vector<8xf32>
				// We just check the loading and insertion of two values only, rest of the
				// values need not be checked as they are emitted in a loop just with
				// different parameters.
				// CHECK: %[[C7:.*]] = llvm.mlir.constant(7 : i32) : i32
				// CHECK: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.*}}[%[[C7]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<8xf32>
				return %0 : !gpu.mma_matrix<16x16xf32, "COp">
				}
				}

				// -----

				gpu.module @main {
				// CHECK-LABEL: load_c_op_16_16_16_no_opselect
				func.func @load_c_op_16_16_16_no_opselect()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[ITER0]] : i32] : vector<16xf16>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// CHECK-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[ITER1]] : i32] : vector<16xf16>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.}}[%{{.}} : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @main {
				// OPSEL-LABEL: load_c_op_16_16_16_opselect
				func.func @load_c_op_16_16_16_opselect()->(!gpu.mma_matrix<16x16xf16, "COp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "COp">
				// OPSEL: llvm.mlir.constant(32 : index) : i32
				// OPSEL: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// OPSEL: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// OPSEL-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// OPSEL-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// OPSEL-NEXT: %[[LOADEDVALS:.*]] = llvm.mlir.undef : vector<16xf16>
				// OPSEL-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// OPSEL-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// OPSEL-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// OPSEL-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// OPSEL-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// OPSEL-NEXT: %[[ROW0:.*]] = llvm.add %[[ITER0]], %[[WTIDDIV16]] : i32
				// OPSEL-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// OPSEL-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// OPSEL-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL-NEXT: %[[LOADED0:.*]] = llvm.load %[[ADDRESS0]] : !llvm.ptr<3> -> f16
				// OPSEL-NEXT: %[[C1C:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[VECOFFSET0:.*]] = llvm.add %[[ITER0]], %[[C1C]] : i32
				// OPSEL-NEXT: %[[LOADEDVALS0:.*]] = llvm.insertelement %[[LOADED0]], %[[LOADEDVALS]][%[[VECOFFSET0]] : i32] : vector<16xf16>
				// OPSEL-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// OPSEL-NEXT: %[[ROW1:.*]] = llvm.add %[[ITER1]], %[[WTIDDIV16]] : i32
				// OPSEL-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// OPSEL-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// OPSEL-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL-NEXT: %[[LOADED1:.*]] = llvm.load %[[ADDRESS1]] : !llvm.ptr<3> -> f16
				// OPSEL-NEXT: %[[C1C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[VECOFFSET1:.*]] = llvm.add %[[ITER1]], %[[C1C1]] : i32
				// OPSEL-NEXT: %[[LOADEDVALS1:.*]] = llvm.insertelement %[[LOADED1]], %[[LOADEDVALS0]][%[[VECOFFSET1]] : i32] : vector<16xf16>
				// OPSEL: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// OPSEL: %[[RES:.]] = llvm.insertelement %{{.}}, %{{.}}[%{{.}} : i32] : vector<16xf16>
				// OPSEL-NEXT: llvm.return %[[RES]] : vector<16xf16>
				return %0 : !gpu.mma_matrix<16x16xf16, "COp">
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f32
				// CHECK-SAME: (%[[SRC:.*]]: vector<8xf32>)
				func.func @store_cop_f32(%arg0: !gpu.mma_matrix<16x16xf32, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf32, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf32, "COp">, memref<32x32xf32, 3>
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[C0]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f32, !llvm.ptr<3>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f32
				// CHECK-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[C1]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f32, !llvm.ptr<3>
				// CHECK: %[[C7:.*]] = llvm.mlir.constant(7 : i32) : i32
				// CHECK: %[[ELE7:.*]] = llvm.extractelement %[[SRC]][%[[C7]] : i32] : vector<8xf32>
				// CHECK-NEXT: llvm.store %[[ELE7]], %{{.*}} : f32, !llvm.ptr<3>
				return
				}
				}

				// -----

				gpu.module @test_module {
				// CHECK-LABEL: store_cop_f16_no_opsel
				// CHECK-SAME: (%[[SRC:.*]]: vector<16xf16>)
				func.func @store_cop_f16_no_opsel(%arg0: !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
				// CHECK: llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// CHECK-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// CHECK-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// CHECK-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// CHECK-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[ITER0]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f16, !llvm.ptr<3>
				// CHECK-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// CHECK-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// CHECK-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// CHECK-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// CHECK-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// CHECK-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[ITER1]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f16, !llvm.ptr<3>
				// CHECK: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// CHECK: %[[ITER15:.*]] = llvm.mul %[[C15]], %[[C2]] : i32
				// CHECK: %[[ADDRESS15:.]] = llvm.getelementptr %[[BASE]][%{{.}}] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// CHECK-NEXT: %[[ELE15:.*]] = llvm.extractelement %[[SRC]][%[[ITER15]] : i32] : vector<16xf16>
				// CHECK-NEXT: llvm.store %[[ELE15]], %[[ADDRESS15]] : f16, !llvm.ptr<3>
				return
				}
				}

				// -----

				gpu.module @test_module {
				// OPSEL-LABEL: store_cop_f16_opsel
				// OPSEL-SAME: (%[[SRC:.*]]: vector<16xf16>)
				func.func @store_cop_f16_opsel(%arg0: !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				%wg_1 = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = arith.constant 16 : index
				%j = arith.constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %wg_1[%i, %j] {leadDimension = 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
				// OPSEL: llvm.mlir.constant(32 : index) : i32
				// OPSEL: %[[BASE:.]] = llvm.getelementptr %{{.}} : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL: %[[C32_0:.*]] = llvm.mlir.constant(32 : index) : i32
				// OPSEL: %[[WARPLOCALTID:.]] = rocdl.mbcnt.hi %{{.}}, %{{.*}} : (i32, i32) -> i32
				// OPSEL-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// OPSEL-NEXT: %[[WRAPPEDTID:.*]] = llvm.srem %[[WARPLOCALTID]], %[[C16]] : i32
				// OPSEL-NEXT: %[[C16:.*]] = llvm.mlir.constant(16 : i32) : i32
				// OPSEL-NEXT: %[[C2:.*]] = llvm.mlir.constant(2 : i32) : i32
				// OPSEL-NEXT: %[[WTIDDIV16:.*]] = llvm.sdiv %[[WARPLOCALTID]], %[[C16]] : i32
				// OPSEL-NEXT: %[[C0:.*]] = llvm.mlir.constant(0 : i32) : i32
				// OPSEL-NEXT: %[[ITER0:.*]] = llvm.mul %[[C0]], %[[C2]] : i32
				// OPSEL-NEXT: %[[ROW0:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER0]] : i32
				// OPSEL-NEXT: %[[ROWLDM0:.*]] = llvm.mul %[[ROW0]], %[[C32_0]] : i32
				// OPSEL-NEXT: %[[OFFSET0:.*]] = llvm.add %[[ROWLDM0]], %[[WRAPPEDTID]] : i32
				// OPSEL-NEXT: %[[ADDRESS0:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET0]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL-NEXT: %[[C01:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[INX0:.*]] = llvm.add %[[ITER0]], %[[C01]] : i32
				// OPSEL-NEXT: %[[ELE0:.*]] = llvm.extractelement %[[SRC]][%[[INX0]] : i32] : vector<16xf16>
				// OPSEL-NEXT: llvm.store %[[ELE0]], %[[ADDRESS0]] : f16, !llvm.ptr<3>
				// OPSEL-NEXT: %[[C1:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[ITER1:.*]] = llvm.mul %[[C1]], %[[C2]] : i32
				// OPSEL-NEXT: %[[ROW1:.*]] = llvm.add %[[WTIDDIV16]], %[[ITER1]] : i32
				// OPSEL-NEXT: %[[ROWLDM1:.*]] = llvm.mul %[[ROW1]], %[[C32_0]] : i32
				// OPSEL-NEXT: %[[OFFSET1:.*]] = llvm.add %[[ROWLDM1]], %[[WRAPPEDTID]] : i32
				// OPSEL-NEXT: %[[ADDRESS1:.*]] = llvm.getelementptr %[[BASE]][%[[OFFSET1]]] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL-NEXT: %[[C11:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[INX1:.*]] = llvm.add %[[ITER1]], %[[C11]] : i32
				// OPSEL-NEXT: %[[ELE1:.*]] = llvm.extractelement %[[SRC]][%[[INX1]] : i32] : vector<16xf16>
				// OPSEL-NEXT: llvm.store %[[ELE1]], %[[ADDRESS1]] : f16, !llvm.ptr<3>
				// OPSEL: %[[C15:.*]] = llvm.mlir.constant(15 : i32) : i32
				// OPSEL-NEXT: %[[ITER15:.*]] = llvm.mul %[[C15]], %[[C2]] : i32
				// OPSEL: %[[ADDRESS15:.]] = llvm.getelementptr %[[BASE]][%{{.}}] : (!llvm.ptr<3>, i32) -> !llvm.ptr<3>, f16
				// OPSEL-NEXT: %[[C151:.*]] = llvm.mlir.constant(1 : i32) : i32
				// OPSEL-NEXT: %[[INX15:.*]] = llvm.add %[[ITER15]], %[[C151]] : i32
				// OPSEL-NEXT: %[[ELE15:.*]] = llvm.extractelement %[[SRC]][%[[INX15]] : i32] : vector<16xf16>
				// OPSEL-NEXT: llvm.store %[[ELE15]], %[[ADDRESS15]] : f16, !llvm.ptr<3>
				return
				}
				}

mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg

This file was added.

				import sys

				# WMMA tests must be enabled via build flag.
				if not config.mlir_run_rocm_wmma_tests:
				config.unsupported = True

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts,gpu-to-hsaco{chip=%chip}))' \
				// RUN: \| mlir-opt -gpu-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf16>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f16
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf16>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf16>, memref<16x16xf16>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf16>, memref<16x16xf16>
				gpu.wait [%stream]

				%res_f32 = memref.alloc() : memref<16x16xf32>
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%load = memref.load %gpu_out[%arg0, %arg1] : memref<16x16xf16>
				%ext = arith.extf %load : f16 to f32
				memref.store %ext, %res_f32[%arg0, %arg1] : memref<16x16xf32>
				}
				}
				%res_f32_cast = memref.cast %res_f32 : memref<16x16xf32> to memref<*xf32>

				// Print the memref after computation.
				call @printMemrefF32(%res_f32_cast) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts,gpu-to-hsaco{chip=%chip}))' \
				// RUN: \| mlir-opt -gpu-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf16>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f16
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf16>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf16>, memref<16x16xf16>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf16>, memref<16x16xf16>
				gpu.wait [%stream]

				%res_f32 = memref.alloc() : memref<16x16xf32>
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%load = memref.load %gpu_out[%arg0, %arg1] : memref<16x16xf16>
				%ext = arith.extf %load : f16 to f32
				memref.store %ext, %res_f32[%arg0, %arg1] : memref<16x16xf32>
				}
				}
				%res_f32_cast = memref.cast %res_f32 : memref<16x16xf32> to memref<*xf32>

				// Print the memref after computation.
				call @printMemrefF32(%res_f32_cast) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5100, 5328, 5556],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5172, 5424, 5676, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5228, 5488, 5748, 6016, 6284],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5236, 5520, 5804, 6080, 6356, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5228, 5520, 5812, 6112, 6412, 6704, 6996],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5172, 5488, 5804, 6112, 6420, 6736, 7052, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5100, 5424, 5748, 6080, 6412, 6736, 7060, 7392, 7724],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5676, 6016, 6356, 6704, 7052, 7392, 7732, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5556, 5920, 6284, 6640, 6996, 7360, 7724, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts,gpu-to-hsaco{chip=%chip}))' \
				// RUN: \| mlir-opt -gpu-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf32>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f32
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%cast_r = arith.index_cast %arg0 : index to i16
				%add = arith.addi %cast_r, %cast_c : i16
				%float = arith.sitofp %add : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
				%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf32>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf32>, memref<16x16xf32>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf32>, memref<16x16xf32>
				gpu.wait [%stream]

				// Print the memref after computation.
				call @printMemrefF32(%33) : (memref<*xf32>) -> ()
				// CHECK: [1240, 1360, 1480, 1600, 1720, 1840, 1960, 2080, 2200, 2320, 2440, 2560, 2680, 2800, 2920, 3040],
				// CHECK-NEXT: [1360, 1496, 1632, 1768, 1904, 2040, 2176, 2312, 2448, 2584, 2720, 2856, 2992, 3128, 3264, 3400],
				// CHECK-NEXT: [1480, 1632, 1784, 1936, 2088, 2240, 2392, 2544, 2696, 2848, 3000, 3152, 3304, 3456, 3608, 3760],
				// CHECK-NEXT: [1600, 1768, 1936, 2104, 2272, 2440, 2608, 2776, 2944, 3112, 3280, 3448, 3616, 3784, 3952, 4120],
				// CHECK-NEXT: [1720, 1904, 2088, 2272, 2456, 2640, 2824, 3008, 3192, 3376, 3560, 3744, 3928, 4112, 4296, 4480],
				// CHECK-NEXT: [1840, 2040, 2240, 2440, 2640, 2840, 3040, 3240, 3440, 3640, 3840, 4040, 4240, 4440, 4640, 4840],
				// CHECK-NEXT: [1960, 2176, 2392, 2608, 2824, 3040, 3256, 3472, 3688, 3904, 4120, 4336, 4552, 4768, 4984, 5200],
				// CHECK-NEXT: [2080, 2312, 2544, 2776, 3008, 3240, 3472, 3704, 3936, 4168, 4400, 4632, 4864, 5096, 5328, 5560],
				// CHECK-NEXT: [2200, 2448, 2696, 2944, 3192, 3440, 3688, 3936, 4184, 4432, 4680, 4928, 5176, 5424, 5672, 5920],
				// CHECK-NEXT: [2320, 2584, 2848, 3112, 3376, 3640, 3904, 4168, 4432, 4696, 4960, 5224, 5488, 5752, 6016, 6280],
				// CHECK-NEXT: [2440, 2720, 3000, 3280, 3560, 3840, 4120, 4400, 4680, 4960, 5240, 5520, 5800, 6080, 6360, 6640],
				// CHECK-NEXT: [2560, 2856, 3152, 3448, 3744, 4040, 4336, 4632, 4928, 5224, 5520, 5816, 6112, 6408, 6704, 7000],
				// CHECK-NEXT: [2680, 2992, 3304, 3616, 3928, 4240, 4552, 4864, 5176, 5488, 5800, 6112, 6424, 6736, 7048, 7360],
				// CHECK-NEXT: [2800, 3128, 3456, 3784, 4112, 4440, 4768, 5096, 5424, 5752, 6080, 6408, 6736, 7064, 7392, 7720],
				// CHECK-NEXT: [2920, 3264, 3608, 3952, 4296, 4640, 4984, 5328, 5672, 6016, 6360, 6704, 7048, 7392, 7736, 8080],
				// CHECK-NEXT: [3040, 3400, 3760, 4120, 4480, 4840, 5200, 5560, 5920, 6280, 6640, 7000, 7360, 7720, 8080, 8440]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -convert-scf-to-cf \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-amdgpu{chipset=%chip index-bitwidth=32},convert-gpu-to-rocdl{chipset=%chip index-bitwidth=32},reconcile-unrealized-casts,gpu-to-hsaco{chip=%chip}))' \
				// RUN: \| mlir-opt -gpu-to-llvm \
				krzysz00Unsubmitted Done Reply Inline Actions Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm integration tests. krzysz00: Please don't hardcode the chip in here and use `%chip` instead like in the other ROCm…
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_rocm_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				func.func @main() {
				%0 = memref.alloc() : memref<16x16xf16>
				%22 = memref.alloc() : memref<16x16xf32>

				%f1 = arith.constant 1.0e+00 : f16
				%f0 = arith.constant 0.0e+00 : f32
				%c0 = arith.constant 0 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1 = arith.constant 1 : index

				// Intialize the Input matrix with ones.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				%cast_c = arith.index_cast %arg1 : index to i16
				%float = arith.sitofp %cast_c : i16 to f16
				memref.store %float, %0[%arg0, %arg1] : memref<16x16xf16>
				}
				}
				// Intialize the accumulator matrix with zeros.
				scf.for %arg0 = %c0 to %c16 step %c1 {
				scf.for %arg1 = %c0 to %c16 step %c1 {
				memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>
				}
				}

				%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
				%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>

				%stream = gpu.wait async
				%gpu_in, %asyncToken_0 = gpu.alloc async [%stream] () : memref<16x16xf16>
				%gpu_out, %asyncToken_1 = gpu.alloc async [%stream] () : memref<16x16xf32>

				%asyncToken_2 = gpu.memcpy async [%stream] %gpu_in, %0 : memref<16x16xf16>, memref<16x16xf16>
				%asyncToken_3 = gpu.memcpy async [%stream] %gpu_out, %22 : memref<16x16xf32>, memref<16x16xf32>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index, transpose} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %gpu_in[%c0, %c0] {leadDimension = 16 : index, transpose} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %gpu_out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

				gpu.subgroup_mma_store_matrix %R, %gpu_out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
				gpu.terminator
				}

				%asyncToken_4 = gpu.memcpy async [%stream] %22, %gpu_out : memref<16x16xf32>, memref<16x16xf32>
				gpu.wait [%stream]

				// Print the memref after computation.
				call @printMemrefF32(%33) : (memref<*xf32>) -> ()
				// CHECK: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
				// CHECK-NEXT: [120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120],
				// CHECK-NEXT: [240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240, 240],
				// CHECK-NEXT: [360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 360],
				// CHECK-NEXT: [480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480, 480],
				// CHECK-NEXT: [600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600, 600],
				// CHECK-NEXT: [720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720, 720],
				// CHECK-NEXT: [840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840, 840],
				// CHECK-NEXT: [960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960, 960],
				// CHECK-NEXT: [1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1080],
				// CHECK-NEXT: [1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200],
				// CHECK-NEXT: [1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320, 1320],
				// CHECK-NEXT: [1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440],
				// CHECK-NEXT: [1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560, 1560],
				// CHECK-NEXT: [1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680, 1680],
				// CHECK-NEXT: [1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800, 1800]
				return
				}

				func.func private @printMemrefF32(memref<*xf32>)

mlir/test/lit.site.cfg.py.in

	Show All 40 Lines
	# requires <cond> to be in the set of available features.			# requires <cond> to be in the set of available features.
	# TODO: Update LIT's TestRunner so that this is not required.			# TODO: Update LIT's TestRunner so that this is not required.
	if config.mlir_run_arm_sve_tests:			if config.mlir_run_arm_sve_tests:
	config.available_features.add("mlir_arm_sve_tests")			config.available_features.add("mlir_arm_sve_tests")
	config.mlir_run_arm_sme_tests = @MLIR_RUN_ARM_SME_TESTS@			config.mlir_run_arm_sme_tests = @MLIR_RUN_ARM_SME_TESTS@
	config.mlir_run_x86vector_tests = @MLIR_RUN_X86VECTOR_TESTS@			config.mlir_run_x86vector_tests = @MLIR_RUN_X86VECTOR_TESTS@
	config.mlir_run_riscv_vector_tests = "@MLIR_RUN_RISCV_VECTOR_TESTS@"			config.mlir_run_riscv_vector_tests = "@MLIR_RUN_RISCV_VECTOR_TESTS@"
	config.mlir_run_cuda_tensor_core_tests = @MLIR_RUN_CUDA_TENSOR_CORE_TESTS@			config.mlir_run_cuda_tensor_core_tests = @MLIR_RUN_CUDA_TENSOR_CORE_TESTS@
				config.mlir_run_rocm_wmma_tests = @MLIR_RUN_ROCM_WMMA_TESTS@
	config.mlir_run_cuda_sm80_tests = @MLIR_RUN_CUDA_SM80_TESTS@			config.mlir_run_cuda_sm80_tests = @MLIR_RUN_CUDA_SM80_TESTS@
	config.mlir_run_cuda_sm80_lt_tests = @MLIR_RUN_CUDA_SM80_LT_TESTS@			config.mlir_run_cuda_sm80_lt_tests = @MLIR_RUN_CUDA_SM80_LT_TESTS@
	config.mlir_run_cuda_sm90_tests = @MLIR_RUN_CUDA_SM90_TESTS@			config.mlir_run_cuda_sm90_tests = @MLIR_RUN_CUDA_SM90_TESTS@
	config.mlir_include_integration_tests = @MLIR_INCLUDE_INTEGRATION_TESTS@			config.mlir_include_integration_tests = @MLIR_INCLUDE_INTEGRATION_TESTS@
	config.arm_emulator_executable = "@ARM_EMULATOR_EXECUTABLE@"			config.arm_emulator_executable = "@ARM_EMULATOR_EXECUTABLE@"
	config.arm_emulator_options = "@ARM_EMULATOR_OPTIONS@"			config.arm_emulator_options = "@ARM_EMULATOR_OPTIONS@"
	config.arm_emulator_mlir_cpu_runner_executable = "@ARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE@"			config.arm_emulator_mlir_cpu_runner_executable = "@ARM_EMULATOR_MLIR_CPU_RUNNER_EXECUTABLE@"
	config.arm_emulator_lli_executable = "@ARM_EMULATOR_LLI_EXECUTABLE@"			config.arm_emulator_lli_executable = "@ARM_EMULATOR_LLI_EXECUTABLE@"
	Show All 11 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialectNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557278

mlir/include/mlir/Conversion/GPUToAMDGPU/GPUToAMDGPUPass.h

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

mlir/include/mlir/Conversion/Passes.h

mlir/include/mlir/Conversion/Passes.td

mlir/include/mlir/Dialect/LLVMIR/CMakeLists.txt

mlir/include/mlir/Dialect/LLVMIR/ROCDLDialect.h

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

mlir/lib/Conversion/CMakeLists.txt

mlir/lib/Conversion/GPUToAMDGPU/CMakeLists.txt

mlir/lib/Conversion/GPUToAMDGPU/LowerGPUOpsToAMDGPUOps.cpp

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp

mlir/lib/Conversion/GPUToROCDL/CMakeLists.txt

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp

mlir/test/CMakeLists.txt

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-chipset.mlir

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-operands.mlir

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu-unsupported-warpsize.mlir

mlir/test/Conversion/GPUToAMDGPU/wmma-ops-to-amdgpu.mlir

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl-unsupported.mlir

mlir/test/Conversion/GPUToROCDL/wmma-ops-to-rocdl.mlir

mlir/test/Integration/GPU/ROCM/WMMA/lit.local.cfg

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f16_16_16_16_f16_opselect.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16.mlir

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir

mlir/test/lit.site.cfg.py.in

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect
Needs ReviewPublic