This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect
ClosedPublic

Authored by krzysz00 on Feb 13 2023, 3:18 PM.

Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache
herhut
dcaballe
nirvedhmeshram
ThomasRaoux
whchung
jakeh-gc

Commits

rG22f0c7a45149: [mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect

Summary

Upcoming AMD hardware will include functions that accept 8-bit floats.
Specifically, there are MFMA instructions that accept 8-bit floats,
either using the same or mixed formats. This patch adds MLIR wrappers
for these intrinsics and explicitly adds support for 8-bit floats in
the gpu-to-rocdl conversion by way of amdgpu-to-rocdl.

Since LLVM does not have f8 types, when targeting LLVM for compilation
on an AMD GPU, both f8 types used on AMD hardware (f8E5M2FNUZ and
f8E4M3FNUZ) are rewritten to i8.

This patch also relaxes the restriction that the types of both source
operands to a amdgpu.mfma instructions match exactly, as this is not
necessarily required for the bf8 (f8E5M2FNUZ) and fp8 (f8E4M3FNUZ)
instructions. In addition, since the buffer_{load,store} operations
maintain a whitelist of permitted types, we add the relevant f8 types
to that list.

This patch does not add any implementations of arithmetic operations
for f8 types.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

krzysz00 created this revision.Feb 13 2023, 3:18 PM

Herald added a reviewer: ftynse. · View Herald TranscriptFeb 13 2023, 3:18 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: Moerafaat, zero9178, bzcheeseman and 30 others. · View Herald Transcript

krzysz00 requested review of this revision.Feb 13 2023, 3:18 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptFeb 13 2023, 3:18 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, wdng. · View Herald Transcript

Harbormaster completed remote builds in B213523: Diff 497130.Feb 13 2023, 3:38 PM

jakeh-gc accepted this revision.Feb 15 2023, 3:28 AM

jakeh-gc added a subscriber: jakeh-gc.

jakeh-gc added inline comments.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
411	typo: varifier -> verifier.

This revision is now accepted and ready to land.Feb 15 2023, 3:28 AM

Fix typo

krzysz00 marked an inline comment as done.Feb 15 2023, 8:45 AM

This revision was landed with ongoing or failed builds.Feb 15 2023, 8:46 AM

Closed by commit rG22f0c7a45149: [mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect (authored by krzysz00). · Explain Why

This revision was automatically updated to reflect the committed changes.

krzysz00 added a commit: rG22f0c7a45149: [mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect.

Harbormaster completed remote builds in B213912: Diff 497695.Feb 15 2023, 9:54 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

AMDGPU/

AMDGPU.td

18 lines

LLVMIR/

ROCDLOps.td

9 lines

lib/

Conversion/

AMDGPUToROCDL/

AMDGPUToROCDL.cpp

47 lines

Dialect/

AMDGPU/

IR/

AMDGPUDialect.cpp

18 lines

test/

Conversion/

AMDGPUToROCDL/

amdgpu-to-rocdl.mlir

24 lines

mfma.mlir

82 lines

Dialect/

AMDGPU/

invalid.mlir

38 lines

ops.mlir

2 lines

Target/

LLVMIR/

rocdl.mlir

41 lines

Diff 497696

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
def AMDGPU_RawBufferLoadOp :		def AMDGPU_RawBufferLoadOp :
AMDGPU_Op<"raw_buffer_load", [AllElementTypesMatch<["value", "memref"]>,		AMDGPU_Op<"raw_buffer_load", [AllElementTypesMatch<["value", "memref"]>,
AttrSizedOperandSegments]>,		AttrSizedOperandSegments]>,
Arguments<(ins Arg<AnyMemRef, "buffer to load from", [MemRead]>:$memref,		Arguments<(ins Arg<AnyMemRef, "buffer to load from", [MemRead]>:$memref,
Variadic<I32>:$indices,		Variadic<I32>:$indices,
DefaultValuedAttr<BoolAttr, "true">:$boundsCheck,		DefaultValuedAttr<BoolAttr, "true">:$boundsCheck,
OptionalAttr<I32Attr>:$indexOffset,		OptionalAttr<I32Attr>:$indexOffset,
Optional<I32>:$sgprOffset)>,		Optional<I32>:$sgprOffset)>,
Results<(outs AnyTypeOf<[BF16, F16, F32, I32, I8,		Results<(outs AnyTypeOf<[BF16, F16, F32, I32, I8, F8E5M2FNUZ, F8E4M3FNUZ,
VectorOfLengthAndType<[2, 4], [F32, I32]>,		VectorOfLengthAndType<[2, 4], [F32, I32]>,
VectorOfLengthAndType<[2, 4, 8], [F16, BF16]>,		VectorOfLengthAndType<[2, 4, 8], [F16, BF16]>,
VectorOfLengthAndType<[2, 4, 8, 16], [I8]>]>:$value)> {		VectorOfLengthAndType<[2, 4, 8, 16],
		[I8, F8E5M2FNUZ, F8E4M3FNUZ]>]>:$value)> {

let summary = "Raw Buffer load, exposing GCN features";		let summary = "Raw Buffer load, exposing GCN features";
let description = [{		let description = [{
The `amdgpu.raw_buffer_load` op is a wrapper around the buffer load intrinsics		The `amdgpu.raw_buffer_load` op is a wrapper around the buffer load intrinsics
available on AMD GPUs, including extensions in newer GPUs.		available on AMD GPUs, including extensions in newer GPUs.

The index into the buffer is computed as for `memref.load` with the additon		The index into the buffer is computed as for `memref.load` with the additon
of `indexOffset` and `sgprOffset` (which may or may not be considered		of `indexOffset` and `sgprOffset` (which may or may not be considered
Show All 29 Lines	def AMDGPU_RawBufferLoadOp :
let hasCanonicalizer = 1;		let hasCanonicalizer = 1;
let hasVerifier = 1;		let hasVerifier = 1;
}		}

/// Raw buffer store		/// Raw buffer store
def AMDGPU_RawBufferStoreOp :		def AMDGPU_RawBufferStoreOp :
AMDGPU_Op<"raw_buffer_store", [AllElementTypesMatch<["value", "memref"]>,		AMDGPU_Op<"raw_buffer_store", [AllElementTypesMatch<["value", "memref"]>,
AttrSizedOperandSegments]>,		AttrSizedOperandSegments]>,
Arguments<(ins AnyTypeOf<[BF16, F16, F32, I32, I8,		Arguments<(ins AnyTypeOf<[BF16, F16, F32, I32, I8, F8E5M2FNUZ, F8E4M3FNUZ,
VectorOfLengthAndType<[2, 4], [F32, I32]>,		VectorOfLengthAndType<[2, 4], [F32, I32]>,
VectorOfLengthAndType<[2, 4, 8], [F16, BF16]>,		VectorOfLengthAndType<[2, 4, 8], [F16, BF16]>,
VectorOfLengthAndType<[2, 4, 8, 16], [I8]>]>:$value,		VectorOfLengthAndType<[2, 4, 8, 16],
		[I8, F8E5M2FNUZ, F8E4M3FNUZ]>]>:$value,
Arg<AnyMemRef, "buffer to store to", [MemWrite]>:$memref,		Arg<AnyMemRef, "buffer to store to", [MemWrite]>:$memref,
Variadic<I32>:$indices,		Variadic<I32>:$indices,
DefaultValuedAttr<BoolAttr, "true">:$boundsCheck,		DefaultValuedAttr<BoolAttr, "true">:$boundsCheck,
OptionalAttr<I32Attr>:$indexOffset,		OptionalAttr<I32Attr>:$indexOffset,
Optional<I32>:$sgprOffset)> {		Optional<I32>:$sgprOffset)> {

let summary = "Raw Buffer Store, exposing GCN features";		let summary = "Raw Buffer Store, exposing GCN features";
let description = [{		let description = [{
▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
def AMDGPU_MFMAPermBAttr : EnumAttr<AMDGPU_Dialect, AMDGPU_MFMAPermB,		def AMDGPU_MFMAPermBAttr : EnumAttr<AMDGPU_Dialect, AMDGPU_MFMAPermB,
"mfma_perm_b">;		"mfma_perm_b">;

// mfma		// mfma
def MFMAInTypes : AnyTypeOf<[F32, F64, I32, I64,		def MFMAInTypes : AnyTypeOf<[F32, F64, I32, I64,
VectorOfLengthAndType<[2], [F32]>,		VectorOfLengthAndType<[2], [F32]>,
VectorOfLengthAndType<[4], [F16]>,		VectorOfLengthAndType<[4], [F16]>,
VectorOfLengthAndType<[2, 4], [BF16]>,		VectorOfLengthAndType<[2, 4], [BF16]>,
VectorOfLengthAndType<[4, 8], [I8]>]>;		VectorOfLengthAndType<[4, 8], [I8]>,
		VectorOfLengthAndType<[8], [F8E5M2FNUZ, F8E4M3FNUZ]>]>;
def MFMAOutTypes : AnyTypeOf<[F64,		def MFMAOutTypes : AnyTypeOf<[F64,
VectorOfLengthAndType<[4, 16, 32], [F32]>,		VectorOfLengthAndType<[4, 16, 32], [F32]>,
VectorOfLengthAndType<[4, 16, 32], [I32]>,		VectorOfLengthAndType<[4, 16, 32], [I32]>,
VectorOfLengthAndType<[4], [F64]>]>;		VectorOfLengthAndType<[4], [F64]>]>;

def AMDGPU_MFMAOp :		def AMDGPU_MFMAOp :
AMDGPU_Op<"mfma", [AllTypesMatch<["sourceA", "sourceB"]>,		AMDGPU_Op<"mfma", [AllTypesMatch<["destC", "destD"]>,
AllTypesMatch<["destC", "destD"]>,
Pure]>,		Pure]>,
Arguments<(ins		Arguments<(ins
I32Attr:$m,		I32Attr:$m,
I32Attr:$n,		I32Attr:$n,
I32Attr:$k,		I32Attr:$k,
I32Attr:$blocks,		I32Attr:$blocks,
MFMAInTypes:$sourceA,		MFMAInTypes:$sourceA,
MFMAInTypes:$sourceB,		MFMAInTypes:$sourceB,
Show All 34 Lines	let description = [{

The negateA, negateB, and negateC flags are only supported for double-precision		The negateA, negateB, and negateC flags are only supported for double-precision
operations on gfx940+.		operations on gfx940+.
}];		}];
let assemblyFormat = [{		let assemblyFormat = [{
$sourceA `*` $sourceB `+` $destC		$sourceA `*` $sourceB `+` $destC
attr-dict		attr-dict
`blgp` `=` $blgp		`blgp` `=` $blgp
`:` type($sourceA) `,` type($destC)		`:` type($sourceA) `,` type($sourceB) `,` type($destC)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

#endif // AMDGPU		#endif // AMDGPU

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	// NEG bitfield. See IntrinsicsAMDGPU.td for more info.			// NEG bitfield. See IntrinsicsAMDGPU.td for more info.
	def ROCDL_mfma_f64_16x16x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.16x16x4f64">;			def ROCDL_mfma_f64_16x16x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.16x16x4f64">;
	def ROCDL_mfma_f64_4x4x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.4x4x4f64">;			def ROCDL_mfma_f64_4x4x4f64 : ROCDL_Mfma_IntrOp<"mfma.f64.4x4x4f64">;
	// New in gfx940.			// New in gfx940.
	def ROCDL_mfma_i32_16x16x32_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x32.i8">;			def ROCDL_mfma_i32_16x16x32_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.16x16x32.i8">;
	def ROCDL_mfma_i32_32x32x16_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x16.i8">;			def ROCDL_mfma_i32_32x32x16_i8 : ROCDL_Mfma_IntrOp<"mfma.i32.32x32x16.i8">;
	def ROCDL_mfma_f32_16x16x8_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x8.xf32">;			def ROCDL_mfma_f32_16x16x8_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x8.xf32">;
	def ROCDL_mfma_f32_32x32x4_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4.xf32">;			def ROCDL_mfma_f32_32x32x4_xf32 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x4.xf32">;
				// fp8, only on gfx940
				def ROCDL_mfma_f32_16x16x32_bf8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.bf8.bf8">;
				def ROCDL_mfma_f32_16x16x32_bf8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.bf8.fp8">;
				def ROCDL_mfma_f32_16x16x32_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.bf8">;
				def ROCDL_mfma_f32_16x16x32_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.fp8">;
				def ROCDL_mfma_f32_32x32x16_bf8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.bf8">;
				def ROCDL_mfma_f32_32x32x16_bf8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.fp8">;
				def ROCDL_mfma_f32_32x32x16_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.bf8">;
				def ROCDL_mfma_f32_32x32x16_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.fp8">;

	//===---------------------------------------------------------------------===//			//===---------------------------------------------------------------------===//
	// Vector buffer load/store intrinsics			// Vector buffer load/store intrinsics

	def ROCDL_MubufLoadOp :			def ROCDL_MubufLoadOp :
	ROCDL_Op<"buffer.load">,			ROCDL_Op<"buffer.load">,
	Results<(outs LLVM_Type:$res)>,			Results<(outs LLVM_Type:$res)>,
	Arguments<(ins LLVM_Type:$rsrc,			Arguments<(ins LLVM_Type:$rsrc,
	▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

Show First 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	static std::optional<StringRef> mfmaOpToIntrinsic(MFMAOp mfma,
}		}

if (sourceElem.isF64() && destElem.isF64() && chipset.minorVersion >= 0x0a) {		if (sourceElem.isF64() && destElem.isF64() && chipset.minorVersion >= 0x0a) {
if (m == 16 && n == 16 && k == 4 && b == 1)		if (m == 16 && n == 16 && k == 4 && b == 1)
return ROCDL::mfma_f64_16x16x4f64::getOperationName();		return ROCDL::mfma_f64_16x16x4f64::getOperationName();
if (m == 4 && n == 4 && k == 4 && b == 4)		if (m == 4 && n == 4 && k == 4 && b == 4)
return ROCDL::mfma_f64_4x4x4f64::getOperationName();		return ROCDL::mfma_f64_4x4x4f64::getOperationName();
}		}

		if (sourceElem.isFloat8E5M2FNUZ() && destElem.isF32() &&
		chipset.minorVersion >= 0x40) {
		// Known to be correct because there are no scalar f8 instructions and
		// because a length mismatch will have been caught by the verifier.
		jakeh-gcUnsubmitted Done Reply Inline Actions typo: varifier -> verifier. jakeh-gc: typo: varifier -> verifier.
		Type sourceBElem =
		mfma.getSourceB().getType().cast<VectorType>().getElementType();
		if (m == 16 && n == 16 && k == 32 && b == 1) {
		if (sourceBElem.isFloat8E5M2FNUZ())
		return ROCDL::mfma_f32_16x16x32_bf8_bf8::getOperationName();
		if (sourceBElem.isFloat8E4M3FNUZ())
		return ROCDL::mfma_f32_16x16x32_bf8_fp8::getOperationName();
		}
		if (m == 32 && n == 32 && k == 16 && b == 1) {
		if (sourceBElem.isFloat8E5M2FNUZ())
		return ROCDL::mfma_f32_32x32x16_bf8_bf8::getOperationName();
		if (sourceBElem.isFloat8E4M3FNUZ())
		return ROCDL::mfma_f32_32x32x16_bf8_fp8::getOperationName();
		}
		}

		if (sourceElem.isFloat8E4M3FNUZ() && destElem.isF32() &&
		chipset.minorVersion >= 0x40) {
		Type sourceBElem =
		mfma.getSourceB().getType().cast<VectorType>().getElementType();
		if (m == 16 && n == 16 && k == 32 && b == 1) {
		if (sourceBElem.isFloat8E5M2FNUZ())
		return ROCDL::mfma_f32_16x16x32_fp8_bf8::getOperationName();
		if (sourceBElem.isFloat8E4M3FNUZ())
		return ROCDL::mfma_f32_16x16x32_fp8_fp8::getOperationName();
		}
		if (m == 32 && n == 32 && k == 16 && b == 1) {
		if (sourceBElem.isFloat8E5M2FNUZ())
		return ROCDL::mfma_f32_32x32x16_fp8_bf8::getOperationName();
		if (sourceBElem.isFloat8E4M3FNUZ())
		return ROCDL::mfma_f32_32x32x16_fp8_fp8::getOperationName();
		}
		}

return std::nullopt;		return std::nullopt;
}		}

namespace {		namespace {
struct MFMAOpLowering : public ConvertOpToLLVMPattern<MFMAOp> {		struct MFMAOpLowering : public ConvertOpToLLVMPattern<MFMAOp> {
MFMAOpLowering(LLVMTypeConverter &converter, Chipset chipset)		MFMAOpLowering(LLVMTypeConverter &converter, Chipset chipset)
: ConvertOpToLLVMPattern<MFMAOp>(converter), chipset(chipset) {}		: ConvertOpToLLVMPattern<MFMAOp>(converter), chipset(chipset) {}

▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	if (failed(applyPartialConversion(getOperation(), target,
signalPassFailure();		signalPassFailure();
}		}
};		};
} // namespace		} // namespace

void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateAMDGPUToROCDLConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
Chipset chipset) {		Chipset chipset) {
		// ROCDL supports fp8 types in some contexts, but there is no LLVM-level f8
		// type. Therefore, for this target, declare f8 to be equal to i8.
		converter.addConversion([](FloatType type) -> std::optional<Type> {
		if (type.isFloat8E5M2FNUZ() \|\| type.isFloat8E4M3FNUZ())
		return IntegerType::get(type.getContext(), 8);
		return std::nullopt;
		});

patterns.add<LDSBarrierOpLowering>(converter);		patterns.add<LDSBarrierOpLowering>(converter);
patterns.add<		patterns.add<
RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,		RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,
RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,		RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,
RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>,		RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>,
MFMAOpLowering>(converter, chipset);		MFMAOpLowering>(converter, chipset);
}		}

std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {		std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {
return std::make_unique<ConvertAMDGPUToROCDLPass>();		return std::make_unique<ConvertAMDGPUToROCDLPass>();
}		}

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	if (auto sourceVector = sourceType.dyn_cast<VectorType>()) {
sourceLen = sourceVector.getNumElements();		sourceLen = sourceVector.getNumElements();
sourceElem = sourceVector.getElementType();		sourceElem = sourceVector.getElementType();
}		}
if (auto destVector = destType.dyn_cast<VectorType>()) {		if (auto destVector = destType.dyn_cast<VectorType>()) {
destLen = destVector.getNumElements();		destLen = destVector.getNumElements();
destElem = destVector.getElementType();		destElem = destVector.getElementType();
}		}

		Type sourceBType = getSourceB().getType();
		if (sourceElem.isFloat8E5M2FNUZ() \|\| sourceElem.isFloat8E4M3FNUZ()) {
		int64_t sourceBLen = 1;
		Type sourceBElem = sourceBType;
		if (auto sourceBVector = sourceBType.dyn_cast<VectorType>()) {
		sourceBLen = sourceBVector.getNumElements();
		sourceBElem = sourceBVector.getElementType();
		}
		if (!sourceBElem.isFloat8E5M2FNUZ() && !sourceBElem.isFloat8E4M3FNUZ())
		return emitOpError("expected both source operands to have f8 elements");
		if (sourceLen != sourceBLen)
		return emitOpError(
		"expected both f8 source vectors to have the same length");
		} else {
		if (sourceType != sourceBType)
		return emitOpError(
		"expected both non-f8 source operand types to match exactly");
		}
// Normalize the wider integer types the compiler expects to i8		// Normalize the wider integer types the compiler expects to i8
if (sourceElem.isInteger(32)) {		if (sourceElem.isInteger(32)) {
sourceLen *= 4;		sourceLen *= 4;
sourceElem = b.getI8Type();		sourceElem = b.getI8Type();
}		}
if (sourceElem.isInteger(64)) {		if (sourceElem.isInteger(64)) {
sourceLen *= 8;		sourceLen *= 8;
sourceElem = b.getI8Type();		sourceElem = b.getI8Type();
Show All 38 Lines

mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir

	Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	func.func @gpu_gcn_raw_buffer_load_i8(%buf: memref<64xi8>, %idx: i32) -> i8 {			func.func @gpu_gcn_raw_buffer_load_i8(%buf: memref<64xi8>, %idx: i32) -> i8 {
	// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)			// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)
	// CHECK: llvm.insertelement{{.*}}%[[numRecords]]			// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
	// CHECK: %[[ret:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i8			// CHECK: %[[ret:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i8
	// CHECK: return %[[ret]]			// CHECK: return %[[ret]]
	%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> i8			%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> i8
	func.return %0 : i8			func.return %0 : i8
	}			}

	// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_2xi8			// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_2xi8
	func.func @gpu_gcn_raw_buffer_load_2xi8(%buf: memref<64xi8>, %idx: i32) -> vector<2xi8> {			func.func @gpu_gcn_raw_buffer_load_2xi8(%buf: memref<64xi8>, %idx: i32) -> vector<2xi8> {
	// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)			// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)
	// CHECK: llvm.insertelement{{.*}}%[[numRecords]]			// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
	// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i16			// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i16
	// CHECK: %[[ret:.*]] = llvm.bitcast %[[loaded]] : i16 to vector<2xi8>			// CHECK: %[[ret:.*]] = llvm.bitcast %[[loaded]] : i16 to vector<2xi8>
	// CHECK: return %[[ret]]			// CHECK: return %[[ret]]
	%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> vector<2xi8>			%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> vector<2xi8>
	func.return %0 : vector<2xi8>			func.return %0 : vector<2xi8>
	}			}

	// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_16xi8			// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_16xi8
	func.func @gpu_gcn_raw_buffer_load_16xi8(%buf: memref<64xi8>, %idx: i32) -> vector<16xi8> {			func.func @gpu_gcn_raw_buffer_load_16xi8(%buf: memref<64xi8>, %idx: i32) -> vector<16xi8> {
	// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : vector<4xi32>			// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : vector<4xi32>
	// CHECK: %[[ret:.*]] = llvm.bitcast %[[loaded]] : vector<4xi32> to vector<16xi8>			// CHECK: %[[ret:.*]] = llvm.bitcast %[[loaded]] : vector<4xi32> to vector<16xi8>
	// CHECK: return %[[ret]]			// CHECK: return %[[ret]]
	%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> vector<16xi8>			%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xi8>, i32 -> vector<16xi8>
	func.return %0 : vector<16xi8>			func.return %0 : vector<16xi8>
	}			}

				// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_f8E5M2FNUZ
				func.func @gpu_gcn_raw_buffer_load_f8E5M2FNUZ(%buf: memref<64xf8E5M2FNUZ>, %idx: i32) -> f8E5M2FNUZ {
				// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)
				// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
				// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i8
				// CHECK: %[[ret:.*]] = builtin.unrealized_conversion_cast %[[loaded]] : i8 to f8E5M2FNUZ
				// CHECK: return %[[ret]]
				%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xf8E5M2FNUZ>, i32 -> f8E5M2FNUZ
				func.return %0 : f8E5M2FNUZ
				}

				// CHECK-LABEL: func @gpu_gcn_raw_buffer_load_4xf8E4M3FNUZ
				func.func @gpu_gcn_raw_buffer_load_4xf8E4M3FNUZ(%buf: memref<64xf8E4M3FNUZ>, %idx: i32) -> vector<4xf8E4M3FNUZ> {
				// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(64 : i32)
				// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
				// CHECK: %[[loaded:.]] = rocdl.raw.buffer.load %{{.}}, %{{.}}, %{{.}}, %{{.*}} : i32
				// CHECK: %[[cast:.*]] = llvm.bitcast %[[loaded]] : i32 to vector<4xi8>
				// CHECK: %[[ret:.*]] = builtin.unrealized_conversion_cast %[[cast]] : vector<4xi8> to vector<4xf8E4M3FNUZ>
				// CHECK: return %[[ret]]
				%0 = amdgpu.raw_buffer_load {boundsCheck = true} %buf[%idx] : memref<64xf8E4M3FNUZ>, i32 -> vector<4xf8E4M3FNUZ>
				func.return %0 : vector<4xf8E4M3FNUZ>
				}

	// Since the lowering logic is shared with loads, only bitcasts need to be rechecked			// Since the lowering logic is shared with loads, only bitcasts need to be rechecked
	// CHECK-LABEL: func @gpu_gcn_raw_buffer_store_i32			// CHECK-LABEL: func @gpu_gcn_raw_buffer_store_i32
	func.func @gpu_gcn_raw_buffer_store_i32(%value: i32, %buf: memref<64xi32>, %idx: i32) {			func.func @gpu_gcn_raw_buffer_store_i32(%value: i32, %buf: memref<64xi32>, %idx: i32) {
	// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)			// CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)
	// CHECK: llvm.insertelement{{.*}}%[[numRecords]]			// CHECK: llvm.insertelement{{.*}}%[[numRecords]]
	// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)			// CHECK: %[[word3:.*]] = llvm.mlir.constant(159744 : i32)
	// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]			// CHECK: %[[resource:.]] = llvm.insertelement{{.}}%[[word3]]
	// CHECK: rocdl.raw.buffer.store %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : i32			// CHECK: rocdl.raw.buffer.store %{{.}} %[[resource]], %{{.}}, %{{.}}, %{{.}} : i32
	Show All 38 Lines

mlir/test/Conversion/AMDGPUToROCDL/mfma.mlir

	// RUN: mlir-opt %s -convert-amdgpu-to-rocdl=chipset=gfx940 \| FileCheck %s			// RUN: mlir-opt %s -convert-amdgpu-to-rocdl=chipset=gfx940 \| FileCheck %s
	func.func @mfma_to_rocdl(%arg0 : f32, %arg1 : vector<32xf32>,			func.func @mfma_to_rocdl(%arg0 : f32, %arg1 : vector<32xf32>,
	%arg2 : vector<16xf32>, %arg3 : vector<4xf32>,			%arg2 : vector<16xf32>, %arg3 : vector<4xf32>,
	%arg4 : vector<4xf16>, %arg5 : vector<4xi8>,			%arg4 : vector<4xf16>, %arg5 : vector<4xi8>,
	%arg6 : vector<32xi32>, %arg7 : vector<16xi32>,			%arg6 : vector<32xi32>, %arg7 : vector<16xi32>,
	%arg8 : vector<4xi32>, %arg9 : vector<2xbf16>,			%arg8 : vector<4xi32>, %arg9 : vector<2xbf16>,
	%arg10 : vector<4xbf16>, %arg11 : f64,			%arg10 : vector<4xbf16>, %arg11 : f64,
	%arg12 : vector<4xf64>, %arg13 : vector<8xi8>,			%arg12 : vector<4xf64>, %arg13 : vector<8xi8>,
	%arg14 : vector<2xf32>) {			%arg14 : vector<2xf32>, %arg15 : vector<8xf8E5M2FNUZ>,
				%arg16 : vector<8xf8E4M3FNUZ>) {
	// CHECK: rocdl.mfma.f32.32x32x1f32{{.*}}: (f32, f32, vector<32xf32>, i32, i32, i32) -> vector<32xf32>			// CHECK: rocdl.mfma.f32.32x32x1f32{{.*}}: (f32, f32, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
	amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : f32, vector<32xf32>			amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : f32, f32, vector<32xf32>
	// CHECK: rocdl.mfma.f32.16x16x1f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.16x16x1f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg0 * %arg0 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : f32, vector<16xf32>			amdgpu.mfma %arg0 * %arg0 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : f32, f32, vector<16xf32>
	// CHECK: rocdl.mfma.f32.4x4x1f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.4x4x1f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg0 * %arg0 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : f32, vector<4xf32>			amdgpu.mfma %arg0 * %arg0 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 1 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : f32, f32, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x2f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.32x32x2f32{{.*}}: (f32, f32, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg0 * %arg0 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : f32, vector<16xf32>			amdgpu.mfma %arg0 * %arg0 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : f32, f32, vector<16xf32>
	// CHECK: rocdl.mfma.f32.16x16x4f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.16x16x4f32{{.*}}: (f32, f32, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg0 * %arg0 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : f32, vector<4xf32>			amdgpu.mfma %arg0 * %arg0 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : f32, f32, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>			// CHECK: rocdl.mfma.f32.32x32x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
	amdgpu.mfma %arg4 * %arg4 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xf16>, vector<32xf32>			amdgpu.mfma %arg4 * %arg4 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xf16>, vector<4xf16>, vector<32xf32>
	// CHECK: rocdl.mfma.f32.16x16x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.16x16x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg4 * %arg4 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xf16>, vector<16xf32>			amdgpu.mfma %arg4 * %arg4 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xf16>, vector<4xf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.4x4x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.4x4x4f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg4 * %arg4 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xf16>, vector<4xf32>			amdgpu.mfma %arg4 * %arg4 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xf16>, vector<4xf16>, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x8f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.32x32x8f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg4 * %arg4 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xf16>, vector<16xf32>			amdgpu.mfma %arg4 * %arg4 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xf16>, vector<4xf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.16x16x16f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.16x16x16f16{{.*}}: (vector<4xf16>, vector<4xf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg4 * %arg4 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xf16>, vector<4xf32>			amdgpu.mfma %arg4 * %arg4 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xf16>, vector<4xf16>, vector<4xf32>
	// CHECK: rocdl.mfma.i32.32x32x4i8{{.*}}: (i32, i32, vector<32xi32>, i32, i32, i32) -> vector<32xi32>			// CHECK: rocdl.mfma.i32.32x32x4i8{{.*}}: (i32, i32, vector<32xi32>, i32, i32, i32) -> vector<32xi32>
	amdgpu.mfma %arg5 * %arg5 + %arg6 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xi8>, vector<32xi32>			amdgpu.mfma %arg5 * %arg5 + %arg6 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xi8>, vector<4xi8>, vector<32xi32>
	// CHECK: rocdl.mfma.i32.16x16x4i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>			// CHECK: rocdl.mfma.i32.16x16x4i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
	amdgpu.mfma %arg5 * %arg5 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xi8>, vector<16xi32>			amdgpu.mfma %arg5 * %arg5 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xi8>, vector<4xi8>, vector<16xi32>
	// CHECK: rocdl.mfma.i32.4x4x4i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>			// CHECK: rocdl.mfma.i32.4x4x4i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
	amdgpu.mfma %arg5 * %arg5 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xi8>, vector<4xi32>			amdgpu.mfma %arg5 * %arg5 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xi8>, vector<4xi8>, vector<4xi32>
	// CHECK: rocdl.mfma.i32.32x32x8i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>			// CHECK: rocdl.mfma.i32.32x32x8i8{{.*}}: (i32, i32, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
	amdgpu.mfma %arg5 * %arg5 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xi8>, vector<16xi32>			amdgpu.mfma %arg5 * %arg5 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xi8>, vector<4xi8>, vector<16xi32>
	// CHECK: rocdl.mfma.i32.16x16x16i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>			// CHECK: rocdl.mfma.i32.16x16x16i8{{.*}}: (i32, i32, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
	amdgpu.mfma %arg5 * %arg5 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xi8>, vector<4xi32>			amdgpu.mfma %arg5 * %arg5 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xi8>, vector<4xi8>, vector<4xi32>
	// CHECK: rocdl.mfma.f32.32x32x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>			// CHECK: rocdl.mfma.f32.32x32x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
	amdgpu.mfma %arg9 * %arg9 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<2xbf16>, vector<32xf32>			amdgpu.mfma %arg9 * %arg9 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<2xbf16>, vector<2xbf16>, vector<32xf32>
	// CHECK: rocdl.mfma.f32.16x16x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.16x16x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg9 * %arg9 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<2xbf16>, vector<16xf32>			amdgpu.mfma %arg9 * %arg9 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<2xbf16>, vector<2xbf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.4x4x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.4x4x2bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg9 * %arg9 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<2xbf16>, vector<4xf32>			amdgpu.mfma %arg9 * %arg9 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 2 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<2xbf16>, vector<2xbf16>, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x4bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.32x32x4bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg9 * %arg9 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<2xbf16>, vector<16xf32>			amdgpu.mfma %arg9 * %arg9 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<2xbf16>, vector<2xbf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.16x16x8bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.16x16x8bf16{{.*}}: (vector<2xbf16>, vector<2xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg9 * %arg9 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<2xbf16>, vector<4xf32>			amdgpu.mfma %arg9 * %arg9 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<2xbf16>, vector<2xbf16>, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>			// CHECK: rocdl.mfma.f32.32x32x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<32xf32>, i32, i32, i32) -> vector<32xf32>
	amdgpu.mfma %arg10 * %arg10 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xbf16>, vector<32xf32>			amdgpu.mfma %arg10 * %arg10 + %arg1 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = none : vector<4xbf16>, vector<4xbf16>, vector<32xf32>
	// CHECK: rocdl.mfma.f32.16x16x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.16x16x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg10 * %arg10 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xbf16>, vector<16xf32>			amdgpu.mfma %arg10 * %arg10 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 4 : i32 } blgp = none : vector<4xbf16>, vector<4xbf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.4x4x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.4x4x4bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg10 * %arg10 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xbf16>, vector<4xf32>			amdgpu.mfma %arg10 * %arg10 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 16 : i32 } blgp = none : vector<4xbf16>, vector<4xbf16>, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x8bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.32x32x8bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg10 * %arg10 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xbf16>, vector<16xf32>			amdgpu.mfma %arg10 * %arg10 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<4xbf16>, vector<4xbf16>, vector<16xf32>
	// CHECK: rocdl.mfma.f32.16x16x16bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.16x16x16bf16.1k{{.*}}: (vector<4xbf16>, vector<4xbf16>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg10 * %arg10 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xbf16>, vector<4xf32>			amdgpu.mfma %arg10 * %arg10 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<4xbf16>, vector<4xbf16>, vector<4xf32>
	// CHECK: rocdl.mfma.f64.16x16x4f64{{.*}}: (f64, f64, vector<4xf64>, i32, i32, i32) -> vector<4xf64>			// CHECK: rocdl.mfma.f64.16x16x4f64{{.*}}: (f64, f64, vector<4xf64>, i32, i32, i32) -> vector<4xf64>
	amdgpu.mfma %arg11 * %arg11 + %arg12 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : f64, vector<4xf64>			amdgpu.mfma %arg11 * %arg11 + %arg12 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : f64, f64, vector<4xf64>
	// CHECK: rocdl.mfma.f64.4x4x4f64{{.*}}: (f64, f64, f64, i32, i32, i32) -> f64			// CHECK: rocdl.mfma.f64.4x4x4f64{{.*}}: (f64, f64, f64, i32, i32, i32) -> f64
	amdgpu.mfma %arg11 * %arg11 + %arg11 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 4 : i32 } blgp = none : f64, f64			amdgpu.mfma %arg11 * %arg11 + %arg11 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 4 : i32, n = 4 : i32, blocks = 4 : i32 } blgp = none : f64, f64, f64
	// CHECK: rocdl.mfma.i32.16x16x32.i8{{.*}}: (i64, i64, vector<4xi32>, i32, i32, i32) -> vector<4xi32>			// CHECK: rocdl.mfma.i32.16x16x32.i8{{.*}}: (i64, i64, vector<4xi32>, i32, i32, i32) -> vector<4xi32>
	amdgpu.mfma %arg13 * %arg13 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xi8>, vector<4xi32>			amdgpu.mfma %arg13 * %arg13 + %arg8 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xi8>, vector<8xi8>, vector<4xi32>
	// CHECK: rocdl.mfma.i32.32x32x16.i8{{.*}}: (i64, i64, vector<16xi32>, i32, i32, i32) -> vector<16xi32>			// CHECK: rocdl.mfma.i32.32x32x16.i8{{.*}}: (i64, i64, vector<16xi32>, i32, i32, i32) -> vector<16xi32>
	amdgpu.mfma %arg13 * %arg13 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xi8>, vector<16xi32>			amdgpu.mfma %arg13 * %arg13 + %arg7 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xi8>, vector<8xi8>, vector<16xi32>
	// CHECK: rocdl.mfma.f32.16x16x8.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>			// CHECK: rocdl.mfma.f32.16x16x8.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
	amdgpu.mfma %arg14 * %arg14 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32, reducePrecision } blgp = none : vector<2xf32>, vector<4xf32>			amdgpu.mfma %arg14 * %arg14 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 8 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32, reducePrecision } blgp = none : vector<2xf32>, vector<2xf32>, vector<4xf32>
	// CHECK: rocdl.mfma.f32.32x32x4.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>			// CHECK: rocdl.mfma.f32.32x32x4.xf32{{.*}}: (vector<2xf32>, vector<2xf32>, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
	amdgpu.mfma %arg14 * %arg14 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32, reducePrecision } blgp = none : vector<2xf32>, vector<16xf32>			amdgpu.mfma %arg14 * %arg14 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 4 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32, reducePrecision } blgp = none : vector<2xf32>, vector<2xf32>, vector<16xf32>
				// CHECK: rocdl.mfma.f32.16x16x32.bf8.bf8{{.*}}: (i64, i64, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
				amdgpu.mfma %arg15 * %arg15 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E5M2FNUZ>, vector<8xf8E5M2FNUZ>, vector<4xf32>
				// CHECK: rocdl.mfma.f32.16x16x32.bf8.fp8{{.*}}: (i64, i64, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
				amdgpu.mfma %arg15 * %arg16 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E5M2FNUZ>, vector<8xf8E4M3FNUZ>, vector<4xf32>
				// CHECK: rocdl.mfma.f32.16x16x32.fp8.bf8{{.*}}: (i64, i64, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
				amdgpu.mfma %arg16 * %arg15 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E4M3FNUZ>, vector<8xf8E5M2FNUZ>, vector<4xf32>
				// CHECK: rocdl.mfma.f32.16x16x32.fp8.fp8{{.*}}: (i64, i64, vector<4xf32>, i32, i32, i32) -> vector<4xf32>
				amdgpu.mfma %arg16 * %arg16 + %arg3 { abid = 0 : i32, cbsz = 0 : i32, k = 32 : i32, m = 16 : i32, n = 16 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E4M3FNUZ>, vector<8xf8E4M3FNUZ>, vector<4xf32>
				// CHECK: rocdl.mfma.f32.32x32x16.bf8.bf8{{.*}}: (i64, i64, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
				amdgpu.mfma %arg15 * %arg15 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E5M2FNUZ>, vector<8xf8E5M2FNUZ>, vector<16xf32>
				// CHECK: rocdl.mfma.f32.32x32x16.bf8.fp8{{.*}}: (i64, i64, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
				amdgpu.mfma %arg15 * %arg16 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E5M2FNUZ>, vector<8xf8E4M3FNUZ>, vector<16xf32>
				// CHECK: rocdl.mfma.f32.32x32x16.fp8.bf8{{.*}}: (i64, i64, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
				amdgpu.mfma %arg16 * %arg15 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E4M3FNUZ>, vector<8xf8E5M2FNUZ>, vector<16xf32>
				// CHECK: rocdl.mfma.f32.32x32x16.fp8.fp8{{.*}}: (i64, i64, vector<16xf32>, i32, i32, i32) -> vector<16xf32>
				amdgpu.mfma %arg16 * %arg16 + %arg2 { abid = 0 : i32, cbsz = 0 : i32, k = 16 : i32, m = 32 : i32, n = 32 : i32, blocks = 1 : i32 } blgp = none : vector<8xf8E4M3FNUZ>, vector<8xf8E4M3FNUZ>, vector<16xf32>

	func.return			func.return
	}			}

mlir/test/Dialect/AMDGPU/invalid.mlir

	// RUN: mlir-opt %s -split-input-file -verify-diagnostics			// RUN: mlir-opt %s -split-input-file -verify-diagnostics

	// -----			// -----

				func.func @bad_source_types(%a: vector<2xf32>, %b: vector<4xf16>,
				%c: vector<32xf32>) -> vector<32xf32> {
				// expected-error@+1 {{'amdgpu.mfma' op expected both non-f8 source operand types to match exactly}}
				%d = amdgpu.mfma %a * %b + %c {
				m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
				abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<2xf32>, vector<4xf16>, vector<32xf32>
				func.return %d : vector<32xf32>
				}

				// -----

				func.func @bad_source_types_f8(%a: vector<8xf8E5M2FNUZ>, %b: vector<8xi8>,
				%c: vector<32xf32>) -> vector<32xf32> {
				// expected-error@+1 {{'amdgpu.mfma' op expected both source operands to have f8 elements}}
				%d = amdgpu.mfma %a * %b + %c {
				m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
				abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<8xf8E5M2FNUZ>, vector<8xi8>, vector<32xf32>
				func.return %d : vector<32xf32>
				}

				// -----

	func.func @bad_source_arguments(%a: vector<2xf32>, %b: vector<2xf32>,			func.func @bad_source_arguments(%a: vector<2xf32>, %b: vector<2xf32>,
	%c: vector<32xf32>) -> vector<32xf32> {			%c: vector<32xf32>) -> vector<32xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op expected 1 source values for this operation but got 2}}			// expected-error@+1 {{'amdgpu.mfma' op expected 1 source values for this operation but got 2}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<2xf32>, vector<32xf32>			abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<2xf32>, vector<2xf32>, vector<32xf32>
	func.return %d : vector<32xf32>			func.return %d : vector<32xf32>
	}			}

	// -----			// -----

	func.func @bad_source_arguments_i8(%a: vector<8xi8>, %b: vector<8xi8>,			func.func @bad_source_arguments_i8(%a: vector<8xi8>, %b: vector<8xi8>,
	%c: vector<4xi32>) -> vector<4xi32> {			%c: vector<4xi32>) -> vector<4xi32> {
	// expected-error@+1 {{'amdgpu.mfma' op expected 4 source values for this operation but got 8}}			// expected-error@+1 {{'amdgpu.mfma' op expected 4 source values for this operation but got 8}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 4 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 4 : i32, blocks = 2 : i32,
	abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<8xi8>, vector<4xi32>			abid = 0 : i32, cbsz = 0 : i32} blgp = none : vector<8xi8>, vector<8xi8>, vector<4xi32>
	func.return %d : vector<4xi32>			func.return %d : vector<4xi32>
	}			}

	// -----			// -----

	func.func @bad_dest_type(%a: f32, %b: f32, %c: vector<16xf32>) -> vector<16xf32> {			func.func @bad_dest_type(%a: f32, %b: f32, %c: vector<16xf32>) -> vector<16xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op expected 32 result values for this operation but got 16}}			// expected-error@+1 {{'amdgpu.mfma' op expected 32 result values for this operation but got 16}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 0 : i32, cbsz = 0 : i32} blgp = none : f32, vector<16xf32>			abid = 0 : i32, cbsz = 0 : i32} blgp = none : f32, f32, vector<16xf32>
	return %d : vector<16xf32>			return %d : vector<16xf32>
	}			}

	// -----			// -----

	func.func @f64_permuting_b(%a: f64, %b: f64, %c: vector<4xf64>) -> vector<4xf64> {			func.func @f64_permuting_b(%a: f64, %b: f64, %c: vector<4xf64>) -> vector<4xf64> {
	// expected-error@+1 {{'amdgpu.mfma' op double-precision ops do not support permuting lanes of B}}			// expected-error@+1 {{'amdgpu.mfma' op double-precision ops do not support permuting lanes of B}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 16 : i32, n = 16 : i32, k = 4 : i32, blocks = 1 : i32,			m = 16 : i32, n = 16 : i32, k = 4 : i32, blocks = 1 : i32,
	abid = 0 : i32, cbsz = 0 : i32} blgp = bcast_first_32 : f64, vector<4xf64>			abid = 0 : i32, cbsz = 0 : i32} blgp = bcast_first_32 : f64, f64, vector<4xf64>
	return %d : vector<4xf64>			return %d : vector<4xf64>
	}			}

	// -----			// -----

	func.func @f64_permuting_a(%a: f64, %b: f64, %c: vector<4xf64>) -> vector<4xf64> {			func.func @f64_permuting_a(%a: f64, %b: f64, %c: vector<4xf64>) -> vector<4xf64> {
	// expected-error@+1 {{'amdgpu.mfma' op double-precision ops do not support permuting lanes of A}}			// expected-error@+1 {{'amdgpu.mfma' op double-precision ops do not support permuting lanes of A}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 16 : i32, n = 16 : i32, k = 4 : i32, blocks = 1 : i32,			m = 16 : i32, n = 16 : i32, k = 4 : i32, blocks = 1 : i32,
	abid = 0 : i32, cbsz = 1 : i32} blgp = none : f64, vector<4xf64>			abid = 0 : i32, cbsz = 1 : i32} blgp = none : f64, f64, vector<4xf64>
	return %d : vector<4xf64>			return %d : vector<4xf64>
	}			}

	// -----			// -----

	func.func @abid_without_bradcast(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {			func.func @abid_without_bradcast(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op block ID for permuting A (abid) must be below 2 ** cbsz}}			// expected-error@+1 {{'amdgpu.mfma' op block ID for permuting A (abid) must be below 2 ** cbsz}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 1 : i32, cbsz = 0 : i32} blgp = none : f32, vector<32xf32>			abid = 1 : i32, cbsz = 0 : i32} blgp = none : f32, f32, vector<32xf32>
	func.return %d : vector<32xf32>			func.return %d : vector<32xf32>
	}			}

	// -----			// -----

	func.func @abid_too_large(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {			func.func @abid_too_large(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op block ID for permuting A (abid) must be below 2 ** cbsz}}			// expected-error@+1 {{'amdgpu.mfma' op block ID for permuting A (abid) must be below 2 ** cbsz}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 2 : i32, cbsz = 1 : i32} blgp = none : f32, vector<32xf32>			abid = 2 : i32, cbsz = 1 : i32} blgp = none : f32, f32, vector<32xf32>
	func.return %d : vector<32xf32>			func.return %d : vector<32xf32>
	}			}

	// -----			// -----

	func.func @no_negation(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {			func.func @no_negation(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op negation flags only available for double-precision operations}}			// expected-error@+1 {{'amdgpu.mfma' op negation flags only available for double-precision operations}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 0 : i32, cbsz = 0 : i32, negateA} blgp = none : f32, vector<32xf32>			abid = 0 : i32, cbsz = 0 : i32, negateA} blgp = none : f32, f32, vector<32xf32>
	func.return %d : vector<32xf32>			func.return %d : vector<32xf32>
	}			}

mlir/test/Dialect/AMDGPU/ops.mlir

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	func.func @lds_barrier() {
// CHECK: amdgpu.lds_barrier		// CHECK: amdgpu.lds_barrier
amdgpu.lds_barrier		amdgpu.lds_barrier
func.return		func.return
}		}

// CHECK-LABEL: func @mfma		// CHECK-LABEL: func @mfma
func.func @mfma(%arg0 : f32, %arg1 : vector<32xf32>) -> vector<32xf32> {		func.func @mfma(%arg0 : f32, %arg1 : vector<32xf32>) -> vector<32xf32> {
// CHECK: amdgpu.mfma		// CHECK: amdgpu.mfma
%0 = amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = bcast_second_32 : f32, vector<32xf32>		%0 = amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = bcast_second_32 : f32, f32, vector<32xf32>
func.return %0 : vector<32xf32>		func.return %0 : vector<32xf32>
}		}

mlir/test/Target/LLVMIR/rocdl.mlir

Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	llvm.func @rocdl.barrier() {
llvm.return		llvm.return
}		}

llvm.func @rocdl.xdlops(%arg0 : f32, %arg1 : f32,		llvm.func @rocdl.xdlops(%arg0 : f32, %arg1 : f32,
%arg2 : vector<32 x f32>, %arg3: i32,		%arg2 : vector<32 x f32>, %arg3: i32,
%arg4 : vector<16 x f32>, %arg5 : vector<4xf32>,		%arg4 : vector<16 x f32>, %arg5 : vector<4xf32>,
%arg6 : vector<4xf16>, %arg7 : vector<32 x i32>,		%arg6 : vector<4xf16>, %arg7 : vector<32 x i32>,
%arg8 : vector<16 x i32>, %arg9 : vector<4xi32>,		%arg8 : vector<16 x i32>, %arg9 : vector<4xi32>,
%arg10 : vector<2xi16>) -> vector<32 x f32> {		%arg10 : vector<2xi16>, %arg11 : i64) -> vector<32 x f32> {
%csti32 = llvm.mlir.constant(42 : i32) : i32		%csti32 = llvm.mlir.constant(42 : i32) : i32

// CHECK-LABEL: rocdl.xdlops		// CHECK-LABEL: rocdl.xdlops
// CHECK: call <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float %{{.}}, float %{{.}}, <32 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})		// CHECK: call <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float %{{.}}, float %{{.}}, <32 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
%r0 = rocdl.mfma.f32.32x32x1f32 %arg0, %arg1, %arg2, %csti32, %csti32, %csti32 :		%r0 = rocdl.mfma.f32.32x32x1f32 %arg0, %arg1, %arg2, %csti32, %csti32, %csti32 :
(f32, f32, vector<32 x f32>,		(f32, f32, vector<32 x f32>,
i32, i32, i32) -> vector<32 x f32>		i32, i32, i32) -> vector<32 x f32>

▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	%r18 = rocdl.mfma.f32.32x32x4bf16 %arg10, %arg10, %arg4, %csti32, %csti32, %csti32 :
(vector<2xi16>, vector<2xi16>, vector<16 x f32>,		(vector<2xi16>, vector<2xi16>, vector<16 x f32>,
i32, i32, i32) -> vector<16 x f32>		i32, i32, i32) -> vector<16 x f32>

// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x8bf16(<2 x i16> %{{.}}, <2 x i16> %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})		// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x8bf16(<2 x i16> %{{.}}, <2 x i16> %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
%r19 = rocdl.mfma.f32.16x16x8bf16 %arg10, %arg10, %arg5, %csti32, %csti32, %csti32 :		%r19 = rocdl.mfma.f32.16x16x8bf16 %arg10, %arg10, %arg5, %csti32, %csti32, %csti32 :
(vector<2xi16>, vector<2xi16>, vector<4xf32>,		(vector<2xi16>, vector<2xi16>, vector<4xf32>,
i32, i32, i32) -> vector<4xf32>		i32, i32, i32) -> vector<4xf32>

		// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x32.bf8.bf8(i64 %{{.}}, i64 %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r20 = rocdl.mfma.f32.16x16x32.bf8.bf8 %arg11, %arg11, %arg5, %csti32, %csti32, %csti32 :
		(i64, i64, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x32.bf8.fp8(i64 %{{.}}, i64 %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r21 = rocdl.mfma.f32.16x16x32.bf8.fp8 %arg11, %arg11, %arg5, %csti32, %csti32, %csti32 :
		(i64, i64, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x32.fp8.bf8(i64 %{{.}}, i64 %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r22 = rocdl.mfma.f32.16x16x32.fp8.bf8 %arg11, %arg11, %arg5, %csti32, %csti32, %csti32 :
		(i64, i64, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: call <4 x float> @llvm.amdgcn.mfma.f32.16x16x32.fp8.fp8(i64 %{{.}}, i64 %{{.}}, <4 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r23 = rocdl.mfma.f32.16x16x32.fp8.fp8 %arg11, %arg11, %arg5, %csti32, %csti32, %csti32 :
		(i64, i64, vector<4xf32>,
		i32, i32, i32) -> vector<4xf32>

		// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.bf8.bf8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r24 = rocdl.mfma.f32.32x32x16.bf8.bf8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :
		(i64, i64, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.bf8.fp8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r25 = rocdl.mfma.f32.32x32x16.bf8.fp8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :
		(i64, i64, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.fp8.bf8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r26 = rocdl.mfma.f32.32x32x16.fp8.bf8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :
		(i64, i64, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>

		// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.bf8.bf8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
		%r27 = rocdl.mfma.f32.32x32x16.bf8.bf8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :
		(i64, i64, vector<16xf32>,
		i32, i32, i32) -> vector<16xf32>
llvm.return %r0 : vector<32 x f32>		llvm.return %r0 : vector<32 x f32>
}		}

llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,		llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,
%offset : i32, %vdata1 : vector<1xf32>,		%offset : i32, %vdata1 : vector<1xf32>,
%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {		%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {
%glc = llvm.mlir.constant(false) : i1		%glc = llvm.mlir.constant(false) : i1
%slc = llvm.mlir.constant(true) : i1		%slc = llvm.mlir.constant(true) : i1
▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] 8-bit float usage in the AMDGPU dialectClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 497696

mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir

mlir/test/Conversion/AMDGPUToROCDL/mfma.mlir

mlir/test/Dialect/AMDGPU/invalid.mlir

mlir/test/Dialect/AMDGPU/ops.mlir

mlir/test/Target/LLVMIR/rocdl.mlir

[mlir][AMDGPU] 8-bit float usage in the AMDGPU dialect
ClosedPublic