This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
AMDGPU/
1/5
AMDGPU.td
-
AMDGPUDialect.h
-
CMakeLists.txt
-
LLVMIR/
-
ROCDLOps.td
-
lib/
-
Conversion/AMDGPUToROCDL/
-
AMDGPUToROCDL/
-
AMDGPUToROCDL.cpp
-
Dialect/AMDGPU/IR/
-
AMDGPU/
-
IR/
-
AMDGPUDialect.cpp
-
CMakeLists.txt
-
test/
-
Conversion/AMDGPUToROCDL/
-
AMDGPUToROCDL/
-
amdgpu-to-rocdl.mlir
-
Dialect/
-
AMDGPU/
-
ops.mlir
-
LLVMIR/
-
rocdl.mlir

Differential D127244

[mlir][AMDGPU] Add `mfma` operation to wrap mfma intrinsics.
AbandonedPublic

Authored by krzysz00 on Jun 7 2022, 12:30 PM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
herhut

Summary

The mfma (matrix fused multiply add) instructions present on some
AMDGPUs provide hardware support for particular matrix multiplication
sies and formats.

In LVVM, these operations are exposed via intrinsics. In order to make
their usage in MLIR more ergonomic, we define a amdgpu.mfma
operation that takes a MFMAInstr enum to specify which instruction
should be used. This allows higher-level code to select the mfma
operation to be used by changing an enum value instead of by selecting
a different operation, improving the ergonomics of generating matrix
multiplication kernels.

The amdgpu.mfma operation also allows operations that logically take
vectors of bytes as inputs, instead of requiring, as LLVM does, that
the inputs be concatenated into an i32 or i64.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

krzysz00 created this revision.Jun 7 2022, 12:30 PM

Herald added a reviewer: ftynse. · View Herald TranscriptJun 7 2022, 12:30 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, kosarev, jsilvanus and 33 others. · View Herald Transcript

krzysz00 requested review of this revision.Jun 7 2022, 12:30 PM

Herald added a reviewer: herhut. · View Herald TranscriptJun 7 2022, 12:30 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, wdng. · View Herald Transcript

Harbormaster completed remote builds in B168377: Diff 434915.Jun 7 2022, 12:49 PM

@ThomasRaoux any comments?

stephenneuendorffer added inline comments.Jun 10 2022, 11:46 AM

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173–190	This seems awkward. Do you really need different ops for all of these, rather than having a single op that considers the types of its arguments, or perhaps takes some small number of parameters? You may need to generate something that says which ops are valid on a particular architecture, but that seems preferable to me.

krzysz00 added inline comments.Jun 10 2022, 12:05 PM

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173–190	MLIR-side, this is one op? And we do need to case, because (to selectively quote the intrinsics list) def int_amdgcn_mfma_f32_16x16x1f32 : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>; def int_amdgcn_mfma_f32_32x32x2f32 : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>; These two instructions take the same types of argument but have different semantics. They both take [64 simd]xf32 inputs for A and B and return a [64]x16xf32 output, but there's more than one option for how to reshape [64 things] * [64 things] -> [256 things] as a matrix multiply. One option as that you're doing [32 x 2] * [2 x 32] -> [32 x 32], but another is [64 x 1] * [1 x 16] -> [64 x 16] - though what the instruction name is saying is that we have 4x[16x1] * [1x16] -> 4x[16x16] (or its transpose if N is the long dimension) So, no, we can't dispatch off type alone.

ThomasRaoux added inline comments.Jun 12 2022, 8:35 PM

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173	Just a few suggestions: Could the type be inferred from the operand type instead of being part of the enum? I wonder if having the dimensions be integer attributes would make the code more generic? I assume that decoupling this slightly from the rocdl intrinsics may be beneficial.

krzysz00 added subscribers: jerryyin, whchung.Jun 14 2022, 8:17 AM

krzysz00 added inline comments.

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173	Pulling in @jerryyin and @whchung for their thoughts, but, from where I'm standing, this is at least partially meant to be wrapper around the intrinsics that lets us have things like the constants being specified as attributes on the op instead of additional arguments.

jerryyin added inline comments.Jun 14 2022, 8:53 AM

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173	I think @ThomasRaoux brings about a valid point: We indeed can bring more structure around this enum, which can provide more information to our side of the `xdlopsSelect.h` in `miopen-dialect`. Each fields carries a meaning in the xdlops instruction, take `f32_32x32x1f32` as an example: First f32 is the return type 32x32 is the size of dimension for A and B matrix 1 is the number of gemm we performed, if the number is larger than 1, then it is a reduction (with sum) Last f32 is the argument type Desirably this can be constructed from a number of fields of attributes that comes inherently with the instruction naming.

Per feedback here, I'm abandoning this revision in favor of

A new revision that adds the new ROCDL intrinsics from LLVM but doesn't touch AMDGPU
Going back downstream to design a better mfma operation that looks like something like mfma {k = K, m = M, n = N, ...} %c = %a * %b.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

AMDGPU/

AMDGPU.td

95 lines

AMDGPUDialect.h

6 lines

CMakeLists.txt

8 lines

LLVMIR/

ROCDLOps.td

34 lines

lib/

Conversion/

AMDGPUToROCDL/

AMDGPUToROCDL.cpp

115 lines

Dialect/

AMDGPU/

IR/

AMDGPUDialect.cpp

150 lines

CMakeLists.txt

2 lines

test/

Conversion/

AMDGPUToROCDL/

amdgpu-to-rocdl.mlir

73 lines

Dialect/

AMDGPU/

ops.mlir

73 lines

LLVMIR/

rocdl.mlir

76 lines

Diff 434915

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td

//===-- AMDGPU.td - AMDGPU dialect definitions - tablegen -------===//		//===-- AMDGPU.td - AMDGPU dialect definitions - tablegen -------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef AMDGPU		#ifndef AMDGPU
#define AMDGPU		#define AMDGPU

include "mlir/Interfaces/SideEffectInterfaces.td"		include "mlir/Interfaces/SideEffectInterfaces.td"
		include "mlir/IR/EnumAttr.td"
include "mlir/IR/OpBase.td"		include "mlir/IR/OpBase.td"

def AMDGPU_Dialect : Dialect {		def AMDGPU_Dialect : Dialect {
let name = "amdgpu";		let name = "amdgpu";
let cppNamespace = "::mlir::amdgpu";		let cppNamespace = "::mlir::amdgpu";
let description = [{		let description = [{
The `AMDGPU` dialect provides wrappers around AMD-specific functionality		The `AMDGPU` dialect provides wrappers around AMD-specific functionality
and LLVM intrinsics. These wrappers should be used in conjunction with		and LLVM intrinsics. These wrappers should be used in conjunction with
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	def AMDGPU_RawBufferAtomicFaddOp :
let assemblyFormat = [{		let assemblyFormat = [{
attr-dict $value `->` $memref `[` $indices `]`		attr-dict $value `->` $memref `[` $indices `]`
(`sgprOffset` $sgprOffset^)? `:`		(`sgprOffset` $sgprOffset^)? `:`
type($value) `->` type($memref) `,` type($indices)		type($value) `->` type($memref) `,` type($indices)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

		// Available MFMA intrinsics.
		// Keep up to date with lvm/include/llvm/IR/IntrinsicsAMDGPU.td
		// Generated by: perl -ne 'BEGIN { $i = 0; } if (/amdgcn_mfma_(\w+)\s:\sAMDGPUMfmaIntrinsic/) { print "I32EnumAttrCase<\"$1\", $i>,\n"; $i += 1; }' l
		def AMDGPU_MFMAInstr : I32EnumAttr<"MFMAInstr",
		"Any of the possible MFMA instructions available on AMD GPUs.",
		[
		I32EnumAttrCase<"f32_32x32x1f32", 0>,
		ThomasRaouxUnsubmitted Not Done Reply Inline Actions Just a few suggestions: Could the type be inferred from the operand type instead of being part of the enum? I wonder if having the dimensions be integer attributes would make the code more generic? I assume that decoupling this slightly from the rocdl intrinsics may be beneficial. ThomasRaoux: Just a few suggestions: Could the type be inferred from the operand type instead of being part…
		krzysz00AuthorUnsubmitted Done Reply Inline Actions Pulling in @jerryyin and @whchung for their thoughts, but, from where I'm standing, this is at least partially meant to be wrapper around the intrinsics that lets us have things like the constants being specified as attributes on the op instead of additional arguments. krzysz00: Pulling in @jerryyin and @whchung for their thoughts, but, from where I'm standing, this is at…
		jerryyinUnsubmitted Not Done Reply Inline Actions I think @ThomasRaoux brings about a valid point: We indeed can bring more structure around this enum, which can provide more information to our side of the `xdlopsSelect.h` in `miopen-dialect`. Each fields carries a meaning in the xdlops instruction, take `f32_32x32x1f32` as an example: First f32 is the return type 32x32 is the size of dimension for A and B matrix 1 is the number of gemm we performed, if the number is larger than 1, then it is a reduction (with sum) Last f32 is the argument type Desirably this can be constructed from a number of fields of attributes that comes inherently with the instruction naming. jerryyin: I think @ThomasRaoux brings about a valid point: We indeed can bring more structure around this…
		I32EnumAttrCase<"f32_16x16x1f32", 1>,
		I32EnumAttrCase<"f32_4x4x1f32", 2>,
		I32EnumAttrCase<"f32_32x32x2f32", 3>,
		I32EnumAttrCase<"f32_16x16x4f32", 4>,
		I32EnumAttrCase<"f32_32x32x4f16", 5>,
		I32EnumAttrCase<"f32_16x16x4f16", 6>,
		I32EnumAttrCase<"f32_4x4x4f16", 7>,
		I32EnumAttrCase<"f32_32x32x8f16", 8>,
		I32EnumAttrCase<"f32_16x16x16f16", 9>,
		I32EnumAttrCase<"i32_32x32x4i8", 10>,
		I32EnumAttrCase<"i32_16x16x4i8", 11>,
		I32EnumAttrCase<"i32_4x4x4i8", 12>,
		I32EnumAttrCase<"i32_32x32x8i8", 13>,
		I32EnumAttrCase<"i32_16x16x16i8", 14>,
		I32EnumAttrCase<"f32_32x32x2bf16", 15>,
		I32EnumAttrCase<"f32_16x16x2bf16", 16>,
		I32EnumAttrCase<"f32_4x4x2bf16", 17>,
		stephenneuendorfferUnsubmitted Not Done Reply Inline Actions This seems awkward. Do you really need different ops for all of these, rather than having a single op that considers the types of its arguments, or perhaps takes some small number of parameters? You may need to generate something that says which ops are valid on a particular architecture, but that seems preferable to me. stephenneuendorffer: This seems awkward. Do you really need different ops for all of these, rather than having a…
		krzysz00AuthorUnsubmitted Not Done Reply Inline Actions MLIR-side, this is one op? And we do need to case, because (to selectively quote the intrinsics list) def int_amdgcn_mfma_f32_16x16x1f32 : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>; def int_amdgcn_mfma_f32_32x32x2f32 : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>; These two instructions take the same types of argument but have different semantics. They both take [64 simd]xf32 inputs for A and B and return a [64]x16xf32 output, but there's more than one option for how to reshape [64 things] * [64 things] -> [256 things] as a matrix multiply. One option as that you're doing [32 x 2] * [2 x 32] -> [32 x 32], but another is [64 x 1] * [1 x 16] -> [64 x 16] - though what the instruction name is saying is that we have 4x[16x1] * [1x16] -> 4x[16x16] (or its transpose if N is the long dimension) So, no, we can't dispatch off type alone. krzysz00: MLIR-side, this is one op? And we do need to case, because (to selectively quote the…
		I32EnumAttrCase<"f32_32x32x4bf16", 18>,
		I32EnumAttrCase<"f32_16x16x8bf16", 19>,
		I32EnumAttrCase<"f32_32x32x4bf16_1k", 20>,
		I32EnumAttrCase<"f32_16x16x4bf16_1k", 21>,
		I32EnumAttrCase<"f32_4x4x4bf16_1k", 22>,
		I32EnumAttrCase<"f32_32x32x8bf16_1k", 23>,
		I32EnumAttrCase<"f32_16x16x16bf16_1k", 24>,
		I32EnumAttrCase<"f64_16x16x4f64", 25>,
		I32EnumAttrCase<"f64_4x4x4f64", 26>,
		I32EnumAttrCase<"i32_16x16x32_i8", 27>,
		I32EnumAttrCase<"i32_32x32x16_i8", 28>,
		I32EnumAttrCase<"f32_16x16x8_xf32", 29>,
		I32EnumAttrCase<"f32_32x32x4_xf32", 30>
		]> {
		let genSpecializedAttr = 0;
		let cppNamespace = "::mlir::amdgpu";
		}

		def AMDGPU_MFMAInstrAttr : EnumAttr<AMDGPU_Dialect, AMDGPU_MFMAInstr,
		"mfma_instr">;

		// mfma
		def MFMAInTypes : AnyTypeOf<[F32, F64, I32, I64,
		VectorOfLengthAndType<[2], [F32]>,
		VectorOfLengthAndType<[4], [F16]>,
		VectorOfLengthAndType<[2, 4], [BF16]>,
		VectorOfLengthAndType<[4, 8], [I8]>]>;
		def MFMAOutTypes : AnyTypeOf<[F64,
		VectorOfLengthAndType<[4, 16, 32], [F32]>,
		VectorOfLengthAndType<[4, 16, 32], [I32]>,
		VectorOfLengthAndType<[4], [F64]>]>;

		def AMDGPU_MFMAOp :
		AMDGPU_Op<"mfma", [AllTypesMatch<["sourceA", "sourceB"]>,
		AllTypesMatch<["destC", "destD"]>]>,
		Arguments<(ins AMDGPU_MFMAInstrAttr:$instr,
		MFMAInTypes:$sourceA,
		MFMAInTypes:$sourceB,
		MFMAOutTypes:$destC,
		I32Attr:$cbsz,
		I32Attr:$abid,
		I32Attr:$blgp)>,
		Results<(outs MFMAOutTypes: $destD)> {
		let summary = "MLIR wrapper for CDNA mfma instructions";
		let description = [{
		The `amdgpu.mfma` op is an MLIR wrapper around intrinsics
		for various `mfma` instructions in the CDNA architecture, which perform
		multiple outer products in order to allow fast matrix multiplication.

		The `instr` enum specifies the mfma instruction to be used, while `immArgs`
		specifies the immediate arguments to said operation.

		Note, this wrapper allows specifying `vector<4Kxi8>` arguments to MFMA
		intrinsics that take an integer type of width `4K`. For example,
		one can provide a vector<4xi8> as an argument to an MFMA instruction that
		logically takes 4 i8s but whose intrinsics are specified to take an i32.
		In these cases, the bytes in the vector will be concatenated in little-endian
		order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).

		The `cbsz`, `abid`, and `blgp` attributes control broadcast and swizzling
		during the computation.
		}];
		let assemblyFormat = [{
		$instr attr-dict $sourceA `*` $sourceB `+` $destC
		`cbsz` `=` $cbsz `abid` `=` $abid `blgp` `=` $blgp
		`:` type($sourceA) `,` type($destC)
		}];
		let hasVerifier = 1;
		}

#endif // AMDGPU		#endif // AMDGPU

mlir/include/mlir/Dialect/AMDGPU/AMDGPUDialect.h

	//===- AMDGPUDialect.h - MLIR Dialect for AMDGPU ---------- C++ --===//			//===- AMDGPUDialect.h - MLIR Dialect for AMDGPU ---------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file declares a dialect for MLIR wrappers around AMDGPU-specific			// This file declares a dialect for MLIR wrappers around AMDGPU-specific
	// intrinssics and for other AMD GPU-specific functionality.			// intrinssics and for other AMD GPU-specific functionality.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_			#ifndef MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_
	#define MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_			#define MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_

	#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/Dialect.h"			#include "mlir/IR/Dialect.h"
	#include "mlir/IR/OpDefinition.h"			#include "mlir/IR/OpDefinition.h"
	#include "mlir/Interfaces/SideEffectInterfaces.h"			#include "mlir/Interfaces/SideEffectInterfaces.h"

	#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h.inc"			#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h.inc"

				#include "mlir/Dialect/AMDGPU/AMDGPUEnums.h.inc"

				#define GET_ATTRDEF_CLASSES
				#include "mlir/Dialect/AMDGPU/AMDGPUAttributes.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/AMDGPU/AMDGPU.h.inc"			#include "mlir/Dialect/AMDGPU/AMDGPU.h.inc"

	#endif // MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_			#endif // MLIR_DIALECT_AMDGPU_AMDGPUDIALECT_H_

mlir/include/mlir/Dialect/AMDGPU/CMakeLists.txt

	add_mlir_dialect(AMDGPU amdgpu)			add_mlir_dialect(AMDGPU amdgpu)
	add_mlir_doc(AMDGPU AMDGPU Dialects/ -gen-dialect-doc)			add_mlir_doc(AMDGPU AMDGPU Dialects/ -gen-dialect-doc)

	set(LLVM_TARGET_DEFINITIONS AMDGPU.td)			set(LLVM_TARGET_DEFINITIONS AMDGPU.td)
				mlir_tablegen(AMDGPUEnums.h.inc -gen-enum-decls)
				mlir_tablegen(AMDGPUEnums.cpp.inc -gen-enum-defs)
				add_public_tablegen_target(MLIRAMDGPUEnumsGen)

				set(LLVM_TARGET_DEFINITIONS AMDGPU.td)
				mlir_tablegen(AMDGPUAttributes.h.inc -gen-attrdef-decls -attrdefs-dialect=amdgpu)
				mlir_tablegen(AMDGPUAttributes.cpp.inc -gen-attrdef-defs -attrdefs-dialect=amdgpu)
				add_public_tablegen_target(MLIRAMDGPUAttributesIncGen)

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	class ROCDL_Mfma_IntrOp<string mnemonic, list<Trait> traits = []> :
LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,		LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
"amdgcn_" # !subst(".","_", mnemonic),		"amdgcn_" # !subst(".","_", mnemonic),
[], [], traits, 1>,		[], [], traits, 1>,
Arguments<(ins Variadic<LLVM_Type>:$args)> {		Arguments<(ins Variadic<LLVM_Type>:$args)> {
let assemblyFormat =		let assemblyFormat =
"$args attr-dict `:` functional-type($args, $res)";		"$args attr-dict `:` functional-type($args, $res)";
}		}

		// Available on all CDNA.
def ROCDL_mfma_f32_32x32x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x1f32">;		def ROCDL_mfma_f32_32x32x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x1f32">;
		def ROCDL_mfma_f32_16x16x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x1f32">;
		def ROCDL_mfma_f32_4x4x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x1f32">;
def ROCDL_mfma_f32_32x32x2f32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x2f32">;		def ROCDL_mfma_f32_32x32x2f32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x2f32">;
def ROCDL_mfma_f32_16x16x4f32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x4f32">;		def ROCDL_mfma_f32_16x16x4f32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x4f32">;
def ROCDL_mfma_f32_16x16x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x1f32">;
def ROCDL_mfma_f32_32x32x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4f16">;		def ROCDL_mfma_f32_32x32x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4f16">;
def ROCDL_mfma_f32_32x32x8f16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x8f16">;
def ROCDL_mfma_f32_16x16x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x4f16">;		def ROCDL_mfma_f32_16x16x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x4f16">;
def ROCDL_mfma_f32_16x16x16f16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x16f16">;
def ROCDL_mfma_f32_32x32x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x2bf16">;
def ROCDL_mfma_f32_32x32x4bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4bf16">;
def ROCDL_mfma_f32_16x16x8bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x8bf16">;
def ROCDL_mfma_f32_16x16x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x2bf16">;
def ROCDL_mfma_f32_4x4x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x2bf16">;
def ROCDL_mfma_f32_4x4x1f32 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x1f32">;
def ROCDL_mfma_f32_4x4x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x4f16">;		def ROCDL_mfma_f32_4x4x4f16 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x4f16">;
		def ROCDL_mfma_f32_32x32x8f16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x8f16">;
		def ROCDL_mfma_f32_16x16x16f16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x16f16">;
def ROCDL_mfma_i32_32x32x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x4i8">;		def ROCDL_mfma_i32_32x32x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x4i8">;
def ROCDL_mfma_i32_16x16x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x4i8">;		def ROCDL_mfma_i32_16x16x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x4i8">;
def ROCDL_mfma_i32_4x4x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.4x4x4i8">;		def ROCDL_mfma_i32_4x4x4i8 : ROCDL_Mfma_IntrOp<"mfma.i32.4x4x4i8">;
def ROCDL_mfma_i32_32x32x8i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x8i8">;		def ROCDL_mfma_i32_32x32x8i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x8i8">;
def ROCDL_mfma_i32_16x16x16i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x16i8">;		def ROCDL_mfma_i32_16x16x16i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x16i8">;
		def ROCDL_mfma_f32_32x32x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x2bf16">;
		def ROCDL_mfma_f32_16x16x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x2bf16">;
		def ROCDL_mfma_f32_4x4x2bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x2bf16">;
		def ROCDL_mfma_f32_32x32x4bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4bf16">;
		def ROCDL_mfma_f32_16x16x8bf16 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x8bf16">;
		// New in gfx90a.
		def ROCDL_mfma_f32_32x32x4bf16_1k : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4bf16.1k">;
		def ROCDL_mfma_f32_16x16x4bf16_1k : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x4bf16.1k">;
		def ROCDL_mfma_f32_4x4x4bf16_1k : ROCDL_Mfma_IntrOp<"mfma.f32.4x4x4bf16.1k">;
		def ROCDL_mfma_f32_32x32x8bf16_1k : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x8bf16.1k">;
		def ROCDL_mfma_f32_16x16x16bf16_1k : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x16bf16.1k">;
		// Note: in gfx940, unlike in gfx90a, the f64 xdlops use the "blgp" argument as a
		// NEG bitfield. See IntrinsicsAMDGPU.td for more info.
		def ROCDL_mfma_f64_16x16x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.16x16x4f64">;
		def ROCDL_mfma_f64_4x4x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.4x4x4f64">;
		// New in gfx940.
		def ROCDL_mfma_i32_16x16x32_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x32.i8">;
		def ROCDL_mfma_i32_32x32x16_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x16.i8">;
		def ROCDL_mfma_f32_16x16x8_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x8.xf32">;
		def ROCDL_mfma_f32_32x32x4_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4.xf32">;

//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
// Vector buffer load/store intrinsics		// Vector buffer load/store intrinsics

def ROCDL_MubufLoadOp :		def ROCDL_MubufLoadOp :
ROCDL_Op<"buffer.load">,		ROCDL_Op<"buffer.load">,
Results<(outs LLVM_Type:$res)>,		Results<(outs LLVM_Type:$res)>,
Arguments<(ins LLVM_Type:$rsrc,		Arguments<(ins LLVM_Type:$rsrc,
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

//===- AMDGPUToROCDL.cpp - AMDGPU to ROCDL dialect conversion -------===//		//===- AMDGPUToROCDL.cpp - AMDGPU to ROCDL dialect conversion -------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"		#include "mlir/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.h"
#include "../PassDetail.h"		#include "../PassDetail.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h"		#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h"
		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"		#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
		#include "llvm/ADT/STLExtras.h"

using namespace mlir;		using namespace mlir;
		using namespace mlir::amdgpu;

static Value createI32Constant(ConversionPatternRewriter &rewriter,		static Value createI32Constant(ConversionPatternRewriter &rewriter,
Location loc, int32_t value) {		Location loc, int32_t value) {
IntegerAttr valAttr = rewriter.getI32IntegerAttr(value);		IntegerAttr valAttr = rewriter.getI32IntegerAttr(value);
Type llvmI32 = rewriter.getI32Type();		Type llvmI32 = rewriter.getI32Type();
return rewriter.create<LLVM::ConstantOp>(loc, llvmI32, valAttr);		return rewriter.createOrFold<LLVM::ConstantOp>(loc, llvmI32, valAttr);
}		}

namespace {		namespace {
/// Define lowering patterns for raw buffer ops		/// Define lowering patterns for raw buffer ops
template <typename GpuOp, typename Intrinsic>		template <typename GpuOp, typename Intrinsic>
struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {		struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {
using ConvertOpToLLVMPattern<GpuOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<GpuOp>::ConvertOpToLLVMPattern;

▲ Show 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	if (lowered->getNumResults() == 1) {
}		}
rewriter.replaceOp(gpuOp, replacement);		rewriter.replaceOp(gpuOp, replacement);
} else {		} else {
rewriter.eraseOp(gpuOp);		rewriter.eraseOp(gpuOp);
}		}
return success();		return success();
}		}
};		};
		} // end anonymous namespace

		/// If `input` is a vector of bytes, concatentate those bytes in little-endian
		/// order to form a single integer of size 8 * [vector length]. This works
		/// around a wart in the AMDGPU intrinsics where operations that logically take
		/// vectors of bytes instead integers. Since we do not want to expose this
		/// implementation detail to MLIR, we correct for it here.
		static Value mfmaConcatIfNeeded(ConversionPatternRewriter &rewriter,
		Location loc, Value input) {
		Type inputType = input.getType();
		if (auto vectorType = inputType.dyn_cast<VectorType>()) {
		if (vectorType.getElementType() != rewriter.getI8Type())
		return input;
		int64_t numBytes = vectorType.getNumElements();
		Type destType = rewriter.getIntegerType(numBytes * 8);
		Value result = rewriter.createOrFold<LLVM::ConstantOp>(
		loc, destType, rewriter.getIntegerAttr(destType, 0));
		for (int64_t i = 0; i < numBytes; ++i) {
		Value idxConst = createI32Constant(rewriter, loc, i);
		Value element =
		rewriter.create<LLVM::ExtractElementOp>(loc, input, idxConst);
		Value extended = rewriter.create<LLVM::ZExtOp>(loc, destType, element);
		Value shiftConst = rewriter.createOrFold<LLVM::ConstantOp>(
		loc, destType, rewriter.getIntegerAttr(destType, i * 8));
		Value shifted = rewriter.create<LLVM::ShlOp>(loc, extended, shiftConst);
		result = rewriter.create<LLVM::OrOp>(loc, result, shifted);
		}
		return result;
		}
		return input;
		}

		/// Return the `rocdl` intrinsic corresponding to a `MFMAInstr` value.
		/// This conversion happens here to allow code up the stack to handle the choice
		/// of mfma by picking between enum variants, which is much more ergonomic than
		/// picking between ops, at the cost of some long switch statements in this
		/// pass.
		static StringRef mfmaInstrToIntrinsicName(MFMAInstr instr) {
		#define LOWERING_CASE(type) \
		case MFMAInstr::type: \
		return ROCDL::mfma_##type::getOperationName();
		switch (instr) {
		LOWERING_CASE(f32_32x32x1f32)
		LOWERING_CASE(f32_16x16x1f32)
		LOWERING_CASE(f32_4x4x1f32)
		LOWERING_CASE(f32_32x32x2f32)
		LOWERING_CASE(f32_16x16x4f32)
		LOWERING_CASE(f32_32x32x4f16)
		LOWERING_CASE(f32_16x16x4f16)
		LOWERING_CASE(f32_4x4x4f16)
		LOWERING_CASE(f32_32x32x8f16)
		LOWERING_CASE(f32_16x16x16f16)
		LOWERING_CASE(i32_32x32x4i8)
		LOWERING_CASE(i32_16x16x4i8)
		LOWERING_CASE(i32_4x4x4i8)
		LOWERING_CASE(i32_32x32x8i8)
		LOWERING_CASE(i32_16x16x16i8)
		LOWERING_CASE(f32_32x32x2bf16)
		LOWERING_CASE(f32_16x16x2bf16)
		LOWERING_CASE(f32_4x4x2bf16)
		LOWERING_CASE(f32_32x32x4bf16)
		LOWERING_CASE(f32_16x16x8bf16)
		LOWERING_CASE(f32_32x32x4bf16_1k)
		LOWERING_CASE(f32_16x16x4bf16_1k)
		LOWERING_CASE(f32_4x4x4bf16_1k)
		LOWERING_CASE(f32_32x32x8bf16_1k)
		LOWERING_CASE(f32_16x16x16bf16_1k)
		LOWERING_CASE(f64_16x16x4f64)
		LOWERING_CASE(f64_4x4x4f64)
		LOWERING_CASE(i32_16x16x32_i8)
		LOWERING_CASE(i32_32x32x16_i8)
		LOWERING_CASE(f32_16x16x8_xf32)
		LOWERING_CASE(f32_32x32x4_xf32)
		}
		#undef LOWERING_CASE
		}

		namespace {
		struct MFMAOpLowering : public ConvertOpToLLVMPattern<MFMAOp> {
		using ConvertOpToLLVMPattern<MFMAOp>::ConvertOpToLLVMPattern;
		LogicalResult
		matchAndRewrite(MFMAOp op, MFMAOpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op.getLoc();
		Type outType = typeConverter->convertType(op.destD().getType());

		OperationState loweredOp(loc, mfmaInstrToIntrinsicName(op.instr()));
		loweredOp.addTypes(outType);
		loweredOp.addOperands({mfmaConcatIfNeeded(rewriter, loc, adaptor.sourceA()),
		mfmaConcatIfNeeded(rewriter, loc, adaptor.sourceB()),
		adaptor.destC(),
		createI32Constant(rewriter, loc, op.cbsz()),
		createI32Constant(rewriter, loc, op.abid()),
		createI32Constant(rewriter, loc, op.blgp())});
		Operation *lowered = rewriter.create(loweredOp);
		rewriter.replaceOp(op, lowered->getResults());
		return success();
		}
		};

struct ConvertAMDGPUToROCDLPass		struct ConvertAMDGPUToROCDLPass
: public ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {		: public ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {
ConvertAMDGPUToROCDLPass() = default;		ConvertAMDGPUToROCDLPass() = default;

void runOnOperation() override {		void runOnOperation() override {
RewritePatternSet patterns(&getContext());		RewritePatternSet patterns(&getContext());
LLVMTypeConverter converter(&getContext());		LLVMTypeConverter converter(&getContext());
populateAMDGPUToROCDLConversionPatterns(converter, patterns);		populateAMDGPUToROCDLConversionPatterns(converter, patterns);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
		target.addIllegalDialect<::mlir::amdgpu::AMDGPUDialect>();
target.addLegalDialect<::mlir::LLVM::LLVMDialect>();		target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
target.addLegalDialect<::mlir::ROCDL::ROCDLDialect>();		target.addLegalDialect<::mlir::ROCDL::ROCDLDialect>();
if (failed(applyPartialConversion(getOperation(), target,		if (failed(applyPartialConversion(getOperation(), target,
std::move(patterns))))		std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}
};		};
} // namespace		} // end anonymous namespace

void mlir::populateAMDGPUToROCDLConversionPatterns(		void mlir::populateAMDGPUToROCDLConversionPatterns(
LLVMTypeConverter &converter, RewritePatternSet &patterns) {		LLVMTypeConverter &converter, RewritePatternSet &patterns) {
patterns.add<		patterns.add<
RawBufferOpLowering<amdgpu::RawBufferLoadOp, ROCDL::RawBufferLoadOp>,		RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,
RawBufferOpLowering<amdgpu::RawBufferStoreOp, ROCDL::RawBufferStoreOp>,		RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,
RawBufferOpLowering<amdgpu::RawBufferAtomicFaddOp,		RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>,
ROCDL::RawBufferAtomicFAddOp>>(converter);		MFMAOpLowering>(converter);
}		}

std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {		std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {
return std::make_unique<ConvertAMDGPUToROCDLPass>();		return std::make_unique<ConvertAMDGPUToROCDLPass>();
}		}

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

	//===- AMDGPUDialect.cpp - MLIR AMDGPU dialect implementation --------===//			//===- AMDGPUDialect.cpp - MLIR AMDGPU dialect implementation --------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the AMDGPU dialect and its operations.			// This file implements the AMDGPU dialect and its operations.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h"			#include "mlir/Dialect/AMDGPU/AMDGPUDialect.h"

	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
				#include "mlir/IR/DialectImplementation.h"
	#include "mlir/IR/OpImplementation.h"			#include "mlir/IR/OpImplementation.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
				#include "llvm/ADT/TypeSwitch.h"

	using namespace mlir;			using namespace mlir;
				using namespace mlir::amdgpu;

	#include "mlir/Dialect/AMDGPU/AMDGPUDialect.cpp.inc"			#include "mlir/Dialect/AMDGPU/AMDGPUDialect.cpp.inc"

	void amdgpu::AMDGPUDialect::initialize() {			void AMDGPUDialect::initialize() {
	addOperations<			addOperations<
	#define GET_OP_LIST			#define GET_OP_LIST
	#include "mlir/Dialect/AMDGPU/AMDGPU.cpp.inc"			#include "mlir/Dialect/AMDGPU/AMDGPU.cpp.inc"
	>();			>();
				addAttributes<
				#define GET_ATTRDEF_LIST
				#include "mlir/Dialect/AMDGPU/AMDGPUAttributes.cpp.inc"
				>();
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// RawBuffer*Op			// RawBuffer*Op
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	template <typename T>			template <typename T>
	static LogicalResult verifyRawBufferOp(T &op) {			static LogicalResult verifyRawBufferOp(T &op) {
	MemRefType bufferType = op.memref().getType().template cast<MemRefType>();			MemRefType bufferType = op.memref().getType().template cast<MemRefType>();
	if (bufferType.getMemorySpaceAsInt() != 0)			if (bufferType.getMemorySpaceAsInt() != 0)
	return op.emitOpError(			return op.emitOpError(
	"Buffer ops must operate on a memref in global memory");			"Buffer ops must operate on a memref in global memory");
	if (!bufferType.hasRank())			if (!bufferType.hasRank())
	return op.emitOpError(			return op.emitOpError(
	"Cannot meaningfully buffer_store to an unranked memref");			"Cannot meaningfully buffer_store to an unranked memref");
	if (static_cast<int64_t>(op.indices().size()) != bufferType.getRank())			if (static_cast<int64_t>(op.indices().size()) != bufferType.getRank())
	return op.emitOpError("Expected " + Twine(bufferType.getRank()) +			return op.emitOpError("Expected " + Twine(bufferType.getRank()) +
	" indices to memref");			" indices to memref");
	return success();			return success();
	}			}

	LogicalResult amdgpu::RawBufferLoadOp::verify() {			LogicalResult RawBufferLoadOp::verify() { return verifyRawBufferOp(*this); }

				LogicalResult RawBufferStoreOp::verify() { return verifyRawBufferOp(*this); }

				LogicalResult RawBufferAtomicFaddOp::verify() {
	return verifyRawBufferOp(*this);			return verifyRawBufferOp(*this);
	}			}

	LogicalResult amdgpu::RawBufferStoreOp::verify() {			//===----------------------------------------------------------------------===//
	return verifyRawBufferOp(*this);			// MFMAOp
				//===----------------------------------------------------------------------===//
				LogicalResult MFMAOp::verify() {
				Builder b(getOperation());
				StringRef instrName = stringifyMFMAInstr(instr());

				Type inType = sourceA().getType();
				switch (instr()) {
				case MFMAInstr::f32_32x32x1f32:
				case MFMAInstr::f32_16x16x1f32:
				case MFMAInstr::f32_4x4x1f32:
				case MFMAInstr::f32_32x32x2f32:
				case MFMAInstr::f32_16x16x4f32:
				if (inType != b.getF32Type())
				return emitOpError(instrName + " requires f32 inputs");
				break;
				case MFMAInstr::f32_32x32x4f16:
				case MFMAInstr::f32_16x16x4f16:
				case MFMAInstr::f32_4x4x4f16:
				case MFMAInstr::f32_32x32x8f16:
				case MFMAInstr::f32_16x16x16f16:
				if (inType != VectorType::get(4, b.getF16Type()))
				return emitOpError(instrName + " requires vector<4xf16> inputs");
				break;
				case MFMAInstr::i32_32x32x4i8:
				case MFMAInstr::i32_16x16x4i8:
				case MFMAInstr::i32_4x4x4i8:
				case MFMAInstr::i32_32x32x8i8:
				case MFMAInstr::i32_16x16x16i8:
				if (inType != b.getI32Type() && inType != VectorType::get(4, b.getI8Type()))
				return emitOpError(instrName + " requires i32 or vector<4xi8> inputs");
				break;
				case MFMAInstr::f32_32x32x2bf16:
				case MFMAInstr::f32_16x16x2bf16:
				case MFMAInstr::f32_4x4x2bf16:
				case MFMAInstr::f32_32x32x4bf16:
				case MFMAInstr::f32_16x16x8bf16:
				if (inType != VectorType::get(2, b.getBF16Type()))
				return emitOpError(instrName + " requires vector<2xbf16> inputs");
				break;
				case MFMAInstr::f32_32x32x4bf16_1k:
				case MFMAInstr::f32_16x16x4bf16_1k:
				case MFMAInstr::f32_4x4x4bf16_1k:
				case MFMAInstr::f32_32x32x8bf16_1k:
				case MFMAInstr::f32_16x16x16bf16_1k:
				if (inType != VectorType::get(4, b.getBF16Type()))
				return emitOpError(instrName + " requires vector<4xbf16> inputs");
				break;
				case MFMAInstr::f64_16x16x4f64:
				case MFMAInstr::f64_4x4x4f64:
				if (inType != b.getF64Type())
				return emitOpError(instrName + " requires f64 inputs");
				break;
				case MFMAInstr::i32_16x16x32_i8:
				case MFMAInstr::i32_32x32x16_i8:
				if (inType != b.getI64Type() && inType != VectorType::get(8, b.getI8Type()))
				return emitOpError(instrName + " requires i64 or vector<8xi8> inputs");
				break;
				case MFMAInstr::f32_16x16x8_xf32:
				case MFMAInstr::f32_32x32x4_xf32:
				if (inType != VectorType::get(2, b.getF32Type()))
				return emitOpError(instrName + " requires vector<2xf32> inputs");
				break;
	}			}

	LogicalResult amdgpu::RawBufferAtomicFaddOp::verify() {			Type outType = destC().getType();
	return verifyRawBufferOp(*this);			switch (instr()) {
				case MFMAInstr::f32_32x32x1f32:
				case MFMAInstr::f32_32x32x4f16:
				case MFMAInstr::f32_32x32x2bf16:
				case MFMAInstr::f32_32x32x4bf16_1k:
				if (outType != VectorType::get(32, b.getF32Type()))
				return emitOpError(instrName + " must have vector<32xf32> outputs");
				break;
				case MFMAInstr::f32_16x16x1f32:
				case MFMAInstr::f32_32x32x2f32:
				case MFMAInstr::f32_16x16x4f16:
				case MFMAInstr::f32_32x32x8f16:
				case MFMAInstr::f32_16x16x2bf16:
				case MFMAInstr::f32_32x32x4bf16:
				case MFMAInstr::f32_16x16x4bf16_1k:
				case MFMAInstr::f32_32x32x8bf16_1k:
				case MFMAInstr::f32_32x32x4_xf32:
				if (outType != VectorType::get(16, b.getF32Type()))
				return emitOpError(instrName + " must have vector<16xf32> outputs");
				break;
				case MFMAInstr::f32_4x4x1f32:
				case MFMAInstr::f32_16x16x4f32:
				case MFMAInstr::f32_4x4x4f16:
				case MFMAInstr::f32_16x16x16f16:
				case MFMAInstr::f32_4x4x2bf16:
				case MFMAInstr::f32_16x16x8bf16:
				case MFMAInstr::f32_4x4x4bf16_1k:
				case MFMAInstr::f32_16x16x16bf16_1k:
				case MFMAInstr::f32_16x16x8_xf32:
				if (outType != VectorType::get(4, b.getF32Type()))
				return emitOpError(instrName + " must have vector<4xf32> outputs");
				break;
				case MFMAInstr::i32_32x32x4i8:

				if (outType != VectorType::get(32, b.getI32Type()))
				return emitOpError(instrName + " must have vector<32xi32> outputs");
				break;
				case MFMAInstr::i32_16x16x4i8:
				case MFMAInstr::i32_32x32x8i8:
				case MFMAInstr::i32_32x32x16_i8:
				if (outType != VectorType::get(16, b.getI32Type()))
				return emitOpError(instrName + " must have vector<16xi32> outputs");
				break;
				case MFMAInstr::i32_4x4x4i8:
				case MFMAInstr::i32_16x16x16i8:
				case MFMAInstr::i32_16x16x32_i8:
				if (outType != VectorType::get(4, b.getI32Type()))
				return emitOpError(instrName + " must have vector<4xi32> outputs");
				break;
				case MFMAInstr::f64_16x16x4f64:
				if (outType != VectorType::get(4, b.getF64Type()))
				return emitOpError(instrName + " must have vector<4xf64> outputs");
				break;
				case MFMAInstr::f64_4x4x4f64:
				if (outType != b.getF64Type())
				return emitOpError(instrName + " must have f64 outputs");
				}
				return success();
	}			}

				#include "mlir/Dialect/AMDGPU/AMDGPUEnums.cpp.inc"

				#define GET_ATTRDEF_CLASSES
				#include "mlir/Dialect/AMDGPU/AMDGPUAttributes.cpp.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/AMDGPU/AMDGPU.cpp.inc"			#include "mlir/Dialect/AMDGPU/AMDGPU.cpp.inc"

mlir/lib/Dialect/AMDGPU/IR/CMakeLists.txt

	add_mlir_dialect_library(MLIRAMDGPU			add_mlir_dialect_library(MLIRAMDGPU
	AMDGPUDialect.cpp			AMDGPUDialect.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/AMDGPU

	DEPENDS			DEPENDS
				MLIRAMDGPUEnumsGen
				MLIRAMDGPUAttributesIncGen
	MLIRAMDGPUIncGen			MLIRAMDGPUIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRIR			MLIRIR
	MLIRSideEffectInterfaces			MLIRSideEffectInterfaces
	)			)

mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	func.func @gpu_gcn_raw_buffer_atomic_fadd_f32(%value: f32, %buf: memref<64xf32>, %idx: i32) {
// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)		// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)
// CHECK: llvm.insertelement{{.*}}%[[numRecords]]		// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)		// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)
// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]		// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]
// CHECK: rocdl.raw.buffer.atomic.fadd %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : f32		// CHECK: rocdl.raw.buffer.atomic.fadd %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : f32
amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, targetIsRDNA = false} %value -> %buf[%idx] : f32 -> memref<64xf32>, i32		amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, targetIsRDNA = false} %value -> %buf[%idx] : f32 -> memref<64xf32>, i32
func.return		func.return
}		}

		func.func @mfma_to_rocdl(%arg0 : f32, %arg1 : vector<32xf32>,
		%arg2 : vector<16xf32>, %arg3 : vector<4xf32>,
		%arg4 : vector<4xf16>, %arg5 : vector<4xi8>,
		%arg6 : vector<32xi32>, %arg7 : vector<16xi32>,
		%arg8 : vector<4xi32>, %arg9 : vector<2xbf16>,
		%arg10 : vector<4xbf16>, %arg11 : f64,
		%arg12 : vector<4xf64>, %arg13 : vector<8xi8>,
		%arg14 : vector<2xf32>) {
		// CHECK: rocdl.mfma.f32.32x32x1f32{{.*}}: (f32, f32, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
		amdgpu.mfma f32_32x32x1f32 %arg0 * %arg0 + %arg1 cbsz = 0 abid = 0 blgp = 0 : f32, vector<32xf32>
		// CHECK: rocdl.mfma.f32.16x16x1f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_16x16x1f32 %arg0 * %arg0 + %arg2 cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
		// CHECK: rocdl.mfma.f32.4x4x1f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_4x4x1f32 %arg0 * %arg0 + %arg3 cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x2f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_32x32x2f32 %arg0 * %arg0 + %arg2 cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
		// CHECK: rocdl.mfma.f32.16x16x4f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_16x16x4f32 %arg0 * %arg0 + %arg3 cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
		amdgpu.mfma f32_32x32x4f16 %arg4 * %arg4 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<32xf32>
		// CHECK: rocdl.mfma.f32.16x16x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_16x16x4f16 %arg4 * %arg4 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.4x4x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_4x4x4f16 %arg4 * %arg4 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x8f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_32x32x8f16 %arg4 * %arg4 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.16x16x16f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_16x16x16f16 %arg4 * %arg4 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
		// CHECK: rocdl.mfma.i32.32x32x4i8{{.*}}: (i32, i32, vector<32xi32>, i32, i32, i32) -> vector<32xi32>
		amdgpu.mfma i32_32x32x4i8 %arg5 * %arg5 + %arg6 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<32xi32>
		// CHECK: rocdl.mfma.i32.16x16x4i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
		amdgpu.mfma i32_16x16x4i8 %arg5 * %arg5 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
		// CHECK: rocdl.mfma.i32.4x4x4i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
		amdgpu.mfma i32_4x4x4i8 %arg5 * %arg5 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
		// CHECK: rocdl.mfma.i32.32x32x8i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
		amdgpu.mfma i32_32x32x8i8 %arg5 * %arg5 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
		// CHECK: rocdl.mfma.i32.16x16x16i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
		amdgpu.mfma i32_16x16x16i8 %arg5 * %arg5 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
		// CHECK: rocdl.mfma.f32.32x32x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
		amdgpu.mfma f32_32x32x2bf16 %arg9 * %arg9 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<32xf32>
		// CHECK: rocdl.mfma.f32.16x16x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_16x16x2bf16 %arg9 * %arg9 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.4x4x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_4x4x2bf16 %arg9 * %arg9 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x4bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_32x32x4bf16 %arg9 * %arg9 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.16x16x8bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_16x16x8bf16 %arg9 * %arg9 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
		amdgpu.mfma f32_32x32x4bf16_1k %arg10 * %arg10 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<32xf32>
		// CHECK: rocdl.mfma.f32.16x16x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_16x16x4bf16_1k %arg10 * %arg10 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.4x4x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_4x4x4bf16_1k %arg10 * %arg10 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x8bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_32x32x8bf16_1k %arg10 * %arg10 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
		// CHECK: rocdl.mfma.f32.16x16x16bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_16x16x16bf16_1k %arg10 * %arg10 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
		// CHECK: rocdl.mfma.f64.16x16x4f64{{.*}}: (f64, f64, vector<4xf64>, i32, i32, i32) -> vector<4xf64>
		amdgpu.mfma f64_16x16x4f64 %arg11 * %arg11 + %arg12 cbsz = 0 abid = 0 blgp = 0 : f64, vector<4xf64>
		// CHECK: rocdl.mfma.f64.4x4x4f64{{.*}}: (f64, f64, f64, i32, i32, i32) -> f64
		amdgpu.mfma f64_4x4x4f64 %arg11 * %arg11 + %arg11 cbsz = 0 abid = 0 blgp = 0 : f64, f64
		// CHECK: rocdl.mfma.i32.16x16x32.i8{{.*}}: (i64, i64, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
		amdgpu.mfma i32_16x16x32_i8 %arg13 * %arg13 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<4xi32>
		// CHECK: rocdl.mfma.i32.32x32x16.i8{{.*}}: (i64, i64, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
		amdgpu.mfma i32_32x32x16_i8 %arg13 * %arg13 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<16xi32>
		// CHECK: rocdl.mfma.f32.16x16x8.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		amdgpu.mfma f32_16x16x8_xf32 %arg14 * %arg14 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<4xf32>
		// CHECK: rocdl.mfma.f32.32x32x4.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		amdgpu.mfma f32_32x32x4_xf32 %arg14 * %arg14 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<16xf32>
		func.return
		}

mlir/test/Dialect/AMDGPU/ops.mlir

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	}			}

	// CHECK-LABEL: func @raw_buffer_atomic_fadd_f32_to_rank_4			// CHECK-LABEL: func @raw_buffer_atomic_fadd_f32_to_rank_4
	func.func @raw_buffer_atomic_fadd_f32_to_rank_4(%value : f32, %dst : memref<128x64x32x16xf32>, %offset : i32, %idx0 : i32, %idx1 : i32, %idx2 : i32, %idx3 : i32) {			func.func @raw_buffer_atomic_fadd_f32_to_rank_4(%value : f32, %dst : memref<128x64x32x16xf32>, %offset : i32, %idx0 : i32, %idx1 : i32, %idx2 : i32, %idx3 : i32) {
	// CHECK: amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32, targetIsRDNA = false} %{{.}} -> %{{.}}[%{{.}}, %{{.}}, %{{.}}] sgprOffset %{{.}} : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32			// CHECK: amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32, targetIsRDNA = false} %{{.}} -> %{{.}}[%{{.}}, %{{.}}, %{{.}}] sgprOffset %{{.}} : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32
	amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32, targetIsRDNA = false} %value -> %dst[%idx0, %idx1, %idx2, %idx3] sgprOffset %offset : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32			amdgpu.raw_buffer_atomic_fadd {boundsCheck = true, indexOffset = 1 : i32, targetIsRDNA = false} %value -> %dst[%idx0, %idx1, %idx2, %idx3] sgprOffset %offset : f32 -> memref<128x64x32x16xf32>, i32, i32, i32, i32
	func.return			func.return
	}			}

				// CHECK-LABEL: func @mfma
				func.func @mfma(%arg0 : f32, %arg1 : vector<32xf32>, %arg2 : vector<16xf32>,
				%arg3 : vector<4xf32>, %arg4 : vector<4xf16>,
				%arg5 : vector<4xi8>, %arg6 : vector<32xi32>,
				%arg7 : vector<16xi32>, %arg8 : vector<4xi32>,
				%arg9 : vector<2xbf16>, %arg10 : vector<4xbf16>, %arg11 : f64,
				%arg12 : vector<4xf64>, %arg13 : vector<8xi8>,
				%arg14 : vector<2xf32>) {
				// CHECK: amdgpu.mfma f32_32x32x1f32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f32, vector<32xf32>
				amdgpu.mfma f32_32x32x1f32 %arg0 * %arg0 + %arg1 cbsz = 0 abid = 0 blgp = 0 : f32, vector<32xf32>
				// CHECK: amdgpu.mfma f32_16x16x1f32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
				amdgpu.mfma f32_16x16x1f32 %arg0 * %arg0 + %arg2 cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
				// CHECK: amdgpu.mfma f32_4x4x1f32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
				amdgpu.mfma f32_4x4x1f32 %arg0 * %arg0 + %arg3 cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x2f32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
				amdgpu.mfma f32_32x32x2f32 %arg0 * %arg0 + %arg2 cbsz = 0 abid = 0 blgp = 0 : f32, vector<16xf32>
				// CHECK: amdgpu.mfma f32_16x16x4f32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
				amdgpu.mfma f32_16x16x4f32 %arg0 * %arg0 + %arg3 cbsz = 0 abid = 0 blgp = 0 : f32, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x4f16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<32xf32>
				amdgpu.mfma f32_32x32x4f16 %arg4 * %arg4 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<32xf32>
				// CHECK: amdgpu.mfma f32_16x16x4f16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
				amdgpu.mfma f32_16x16x4f16 %arg4 * %arg4 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_4x4x4f16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
				amdgpu.mfma f32_4x4x4f16 %arg4 * %arg4 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x8f16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
				amdgpu.mfma f32_32x32x8f16 %arg4 * %arg4 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_16x16x16f16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
				amdgpu.mfma f32_16x16x16f16 %arg4 * %arg4 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xf16>, vector<4xf32>
				// CHECK: amdgpu.mfma i32_32x32x4i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<32xi32>
				amdgpu.mfma i32_32x32x4i8 %arg5 * %arg5 + %arg6 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<32xi32>
				// CHECK: amdgpu.mfma i32_16x16x4i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
				amdgpu.mfma i32_16x16x4i8 %arg5 * %arg5 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
				// CHECK: amdgpu.mfma i32_4x4x4i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
				amdgpu.mfma i32_4x4x4i8 %arg5 * %arg5 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
				// CHECK: amdgpu.mfma i32_32x32x8i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
				amdgpu.mfma i32_32x32x8i8 %arg5 * %arg5 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<16xi32>
				// CHECK: amdgpu.mfma i32_16x16x16i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
				amdgpu.mfma i32_16x16x16i8 %arg5 * %arg5 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<4xi8>, vector<4xi32>
				// CHECK: amdgpu.mfma f32_32x32x2bf16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<32xf32>
				amdgpu.mfma f32_32x32x2bf16 %arg9 * %arg9 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<32xf32>
				// CHECK: amdgpu.mfma f32_16x16x2bf16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
				amdgpu.mfma f32_16x16x2bf16 %arg9 * %arg9 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_4x4x2bf16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
				amdgpu.mfma f32_4x4x2bf16 %arg9 * %arg9 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x4bf16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
				amdgpu.mfma f32_32x32x4bf16 %arg9 * %arg9 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_16x16x8bf16 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
				amdgpu.mfma f32_16x16x8bf16 %arg9 * %arg9 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xbf16>, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x4bf16_1k %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<32xf32>
				amdgpu.mfma f32_32x32x4bf16_1k %arg10 * %arg10 + %arg1 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<32xf32>
				// CHECK: amdgpu.mfma f32_16x16x4bf16_1k %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
				amdgpu.mfma f32_16x16x4bf16_1k %arg10 * %arg10 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_4x4x4bf16_1k %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
				amdgpu.mfma f32_4x4x4bf16_1k %arg10 * %arg10 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x8bf16_1k %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
				amdgpu.mfma f32_32x32x8bf16_1k %arg10 * %arg10 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<16xf32>
				// CHECK: amdgpu.mfma f32_16x16x16bf16_1k %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
				amdgpu.mfma f32_16x16x16bf16_1k %arg10 * %arg10 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<4xbf16>, vector<4xf32>
				// CHECK: amdgpu.mfma f64_16x16x4f64 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f64, vector<4xf64>
				amdgpu.mfma f64_16x16x4f64 %arg11 * %arg11 + %arg12 cbsz = 0 abid = 0 blgp = 0 : f64, vector<4xf64>
				// CHECK: amdgpu.mfma f64_4x4x4f64 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : f64, f64
				amdgpu.mfma f64_4x4x4f64 %arg11 * %arg11 + %arg11 cbsz = 0 abid = 0 blgp = 0 : f64, f64
				// CHECK: amdgpu.mfma i32_16x16x32_i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<4xi32>
				amdgpu.mfma i32_16x16x32_i8 %arg13 * %arg13 + %arg8 cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<4xi32>
				// CHECK: amdgpu.mfma i32_32x32x16_i8 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<16xi32>
				amdgpu.mfma i32_32x32x16_i8 %arg13 * %arg13 + %arg7 cbsz = 0 abid = 0 blgp = 0 : vector<8xi8>, vector<16xi32>
				// CHECK: amdgpu.mfma f32_16x16x8_xf32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<4xf32>
				amdgpu.mfma f32_16x16x8_xf32 %arg14 * %arg14 + %arg3 cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<4xf32>
				// CHECK: amdgpu.mfma f32_32x32x4_xf32 %{{.}} %{{.}} + %{{.}} cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<16xf32>
				amdgpu.mfma f32_32x32x4_xf32 %arg14 * %arg14 + %arg2 cbsz = 0 abid = 0 blgp = 0 : vector<2xf32>, vector<16xf32>
				func.return
				}

mlir/test/Dialect/LLVMIR/rocdl.mlir

Show All 34 Lines	func.func @rocdl.barrier() {
llvm.return		llvm.return
}		}

func.func @rocdl.xdlops(%arg0 : f32, %arg1 : f32,		func.func @rocdl.xdlops(%arg0 : f32, %arg1 : f32,
%arg2 : vector<32xf32>, %arg3 : i32,		%arg2 : vector<32xf32>, %arg3 : i32,
%arg4 : vector<16xf32>, %arg5 : vector<4xf32>,		%arg4 : vector<16xf32>, %arg5 : vector<4xf32>,
%arg6 : vector<4xf16>, %arg7 : vector<32xi32>,		%arg6 : vector<4xf16>, %arg7 : vector<32xi32>,
%arg8 : vector<16xi32>, %arg9 : vector<4xi32>,		%arg8 : vector<16xi32>, %arg9 : vector<4xi32>,
%arg10 : vector<2xi16>) -> vector<32xf32> {		%arg10 : vector<2xi16>, %arg11 : vector<4xi16>,
		%arg12 : vector<4xf64>, %arg13 : f64,
		%arg14 : i64, %arg15 : vector<2xf32>) {
// CHECK-LABEL: rocdl.xdlops		// CHECK-LABEL: rocdl.xdlops
// CHECK: rocdl.mfma.f32.32x32x1f32 {{.*}} : (f32, f32, vector<32xf32>, i32, i32, i32) -> vector<32xf32>		// CHECK: rocdl.mfma.f32.32x32x1f32 {{.*}} : (f32, f32, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
%r0 = rocdl.mfma.f32.32x32x1f32 %arg0, %arg1, %arg2, %arg3, %arg3, %arg3 :		%r0 = rocdl.mfma.f32.32x32x1f32 %arg0, %arg1, %arg2, %arg3, %arg3, %arg3 :
(f32, f32, vector<32xf32>,		(f32, f32, vector<32xf32>,
i32, i32, i32) -> vector<32xf32>		i32, i32, i32) -> vector<32xf32>

// CHECK: rocdl.mfma.f32.16x16x1f32 {{.*}} : (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>		// CHECK: rocdl.mfma.f32.16x16x1f32 {{.*}} : (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
%r1 = rocdl.mfma.f32.16x16x1f32 %arg0, %arg1, %arg4, %arg3, %arg3, %arg3 :		%r1 = rocdl.mfma.f32.16x16x1f32 %arg0, %arg1, %arg4, %arg3, %arg3, %arg3 :
(f32, f32, vector<16xf32>,		(f32, f32, vector<16xf32>,
i32, i32, i32) -> vector<16xf32>		i32, i32, i32) -> vector<16xf32>

// CHECK: rocdl.mfma.f32.16x16x4f32 {{.*}} : (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
%r2 = rocdl.mfma.f32.16x16x4f32 %arg0, %arg1, %arg5, %arg3, %arg3, %arg3 :
(f32, f32, vector<4xf32>,
i32, i32, i32) -> vector<4xf32>

// CHECK: rocdl.mfma.f32.4x4x1f32 {{.*}} : (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>		// CHECK: rocdl.mfma.f32.4x4x1f32 {{.*}} : (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
%r3 = rocdl.mfma.f32.4x4x1f32 %arg0, %arg1, %arg5, %arg3, %arg3, %arg3 :		%r2 = rocdl.mfma.f32.4x4x1f32 %arg0, %arg1, %arg5, %arg3, %arg3, %arg3 :
(f32, f32, vector<4xf32>,		(f32, f32, vector<4xf32>,
i32, i32, i32) -> vector<4xf32>		i32, i32, i32) -> vector<4xf32>

// CHECK: rocdl.mfma.f32.32x32x2f32 {{.*}} : (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>		// CHECK: rocdl.mfma.f32.32x32x2f32 {{.*}} : (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
%r4= rocdl.mfma.f32.32x32x2f32 %arg0, %arg1, %arg4, %arg3, %arg3, %arg3 :		%r3= rocdl.mfma.f32.32x32x2f32 %arg0, %arg1, %arg4, %arg3, %arg3, %arg3 :
(f32, f32, vector<16xf32>,		(f32, f32, vector<16xf32>,
i32, i32, i32) -> vector<16xf32>		i32, i32, i32) -> vector<16xf32>

		// CHECK: rocdl.mfma.f32.16x16x4f32 {{.*}} : (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		%r4 = rocdl.mfma.f32.16x16x4f32 %arg0, %arg1, %arg5, %arg3, %arg3, %arg3 :
		(f32, f32, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

// CHECK: rocdl.mfma.f32.32x32x4f16 {{.*}} : (vector<4xf16>, vector<4xf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>		// CHECK: rocdl.mfma.f32.32x32x4f16 {{.*}} : (vector<4xf16>, vector<4xf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
%r5 = rocdl.mfma.f32.32x32x4f16 %arg6, %arg6, %arg2, %arg3, %arg3, %arg3 :		%r5 = rocdl.mfma.f32.32x32x4f16 %arg6, %arg6, %arg2, %arg3, %arg3, %arg3 :
(vector<4xf16>, vector<4xf16>, vector<32xf32>,		(vector<4xf16>, vector<4xf16>, vector<32xf32>,
i32, i32, i32) -> vector<32xf32>		i32, i32, i32) -> vector<32xf32>

// CHECK: rocdl.mfma.f32.16x16x4f16 {{.*}} : (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>		// CHECK: rocdl.mfma.f32.16x16x4f16 {{.*}} : (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
%r6 = rocdl.mfma.f32.16x16x4f16 %arg6, %arg6, %arg4, %arg3, %arg3, %arg3 :		%r6 = rocdl.mfma.f32.16x16x4f16 %arg6, %arg6, %arg4, %arg3, %arg3, %arg3 :
(vector<4xf16>, vector<4xf16>, vector<16xf32>,		(vector<4xf16>, vector<4xf16>, vector<16xf32>,
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	%r18 = rocdl.mfma.f32.32x32x4bf16 %arg10, %arg10, %arg4, %arg3, %arg3, %arg3 :
(vector<2xi16>, vector<2xi16>, vector<16xf32>,		(vector<2xi16>, vector<2xi16>, vector<16xf32>,
i32, i32, i32) -> vector<16xf32>		i32, i32, i32) -> vector<16xf32>

// CHECK: rocdl.mfma.f32.16x16x8bf16 {{.*}} : (vector<2xi16>, vector<2xi16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>		// CHECK: rocdl.mfma.f32.16x16x8bf16 {{.*}} : (vector<2xi16>, vector<2xi16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
%r19 = rocdl.mfma.f32.16x16x8bf16 %arg10, %arg10, %arg5, %arg3, %arg3, %arg3 :		%r19 = rocdl.mfma.f32.16x16x8bf16 %arg10, %arg10, %arg5, %arg3, %arg3, %arg3 :
(vector<2xi16>, vector<2xi16>, vector<4xf32>,		(vector<2xi16>, vector<2xi16>, vector<4xf32>,
i32, i32, i32) -> vector<4xf32>		i32, i32, i32) -> vector<4xf32>

llvm.return %r0 : vector<32xf32>
		// CHECK: rocdl.mfma.f32.32x32x4bf16.1k {{.*}} : (vector<4xi16>, vector<4xi16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
		%r20 = rocdl.mfma.f32.32x32x4bf16.1k %arg11, %arg11, %arg2, %arg3, %arg3, %arg3 :
		(vector<4xi16>, vector<4xi16>, vector<32xf32>,
		i32, i32, i32) -> vector<32xf32>

		// CHECK: rocdl.mfma.f32.16x16x4bf16.1k {{.*}} : (vector<4xi16>, vector<4xi16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		%r21 = rocdl.mfma.f32.16x16x4bf16.1k %arg11, %arg11, %arg4, %arg3, %arg3, %arg3 :
		(vector<4xi16>, vector<4xi16>, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		// CHECK: rocdl.mfma.f32.4x4x4bf16.1k {{.*}} : (vector<4xi16>, vector<4xi16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		%r22 = rocdl.mfma.f32.4x4x4bf16.1k %arg11, %arg11, %arg5, %arg3, %arg3, %arg3 :
		(vector<4xi16>, vector<4xi16>, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: rocdl.mfma.f32.32x32x8bf16.1k {{.*}} : (vector<4xi16>, vector<4xi16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		%r23 = rocdl.mfma.f32.32x32x8bf16.1k %arg11, %arg11, %arg4, %arg3, %arg3, %arg3 :
		(vector<4xi16>, vector<4xi16>, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		// CHECK: rocdl.mfma.f32.16x16x16bf16.1k {{.*}} : (vector<4xi16>, vector<4xi16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		%r24 = rocdl.mfma.f32.16x16x16bf16.1k %arg11, %arg11, %arg5, %arg3, %arg3, %arg3 :
		(vector<4xi16>, vector<4xi16>, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: rocdl.mfma.f64.16x16x4f64 {{.*}} : (f64, f64, vector<4xf64>, i32, i32, i32) -> vector<4xf64>
		%r25 = rocdl.mfma.f64.16x16x4f64 %arg13, %arg13, %arg12, %arg3, %arg3, %arg3 :
		(f64, f64, vector<4xf64>,
		i32, i32, i32) -> vector<4xf64>

		// CHECK: rocdl.mfma.f64.4x4x4f64 {{.*}} : (f64, f64, f64, i32, i32, i32) -> f64
		%r26 = rocdl.mfma.f64.4x4x4f64 %arg13, %arg13, %arg13, %arg3, %arg3, %arg3 :
		(f64, f64, f64,
		i32, i32, i32) -> f64

		// CHECK: rocdl.mfma.i32.16x16x32.i8 {{.*}} : (i64, i64, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
		%r27 = rocdl.mfma.i32.16x16x32.i8 %arg14, %arg14, %arg9, %arg3, %arg3, %arg3 :
		(i64, i64, vector<4xi32>,
		i32, i32, i32) -> vector<4xi32>

		// CHECK: rocdl.mfma.i32.32x32x16.i8 {{.*}} : (i64, i64, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
		%r28 = rocdl.mfma.i32.32x32x16.i8 %arg14, %arg14, %arg8, %arg3, %arg3, %arg3 :
		(i64, i64, vector<16xi32>,
		i32, i32, i32) -> vector<16xi32>

		// CHECK: rocdl.mfma.f32.16x16x8.xf32 {{.*}} : (vector<2xf32>, vector<2xf32>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
		%r29 = rocdl.mfma.f32.16x16x8.xf32 %arg15, %arg15, %arg5, %arg3, %arg3, %arg3 :
		(vector<2xf32>, vector<2xf32>, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: rocdl.mfma.f32.32x32x4.xf32 {{.*}} : (vector<2xf32>, vector<2xf32>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
		%r30 = rocdl.mfma.f32.32x32x4.xf32 %arg15, %arg15, %arg4, %arg3, %arg3, %arg3 :
		(vector<2xf32>, vector<2xf32>, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		llvm.return
}		}

llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,		llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,
%offset : i32, %glc : i1,		%offset : i32, %glc : i1,
%slc : i1, %vdata1 : vector<1xf32>,		%slc : i1, %vdata1 : vector<1xf32>,
%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {		%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {
// CHECK-LABEL: rocdl.mubuf		// CHECK-LABEL: rocdl.mubuf
// CHECK: %{{.}} = rocdl.buffer.load %{{.}} %{{.}} %{{.}} %{{.}} %{{.}} : vector<1xf32>		// CHECK: %{{.}} = rocdl.buffer.load %{{.}} %{{.}} %{{.}} %{{.}} %{{.}} : vector<1xf32>
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] Add `mfma` operation to wrap mfma intrinsics.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 434915

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td

mlir/include/mlir/Dialect/AMDGPU/AMDGPUDialect.h

mlir/include/mlir/Dialect/AMDGPU/CMakeLists.txt

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

mlir/lib/Dialect/AMDGPU/IR/CMakeLists.txt

mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir

mlir/test/Dialect/AMDGPU/ops.mlir

mlir/test/Dialect/LLVMIR/rocdl.mlir

[mlir][AMDGPU] Add `mfma` operation to wrap mfma intrinsics.
AbandonedPublic