This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
AMDGPU/IR/
-
IR/
-
AMDGPU.td
-
LLVMIR/
-
ROCDLOps.td
-
lib/
-
Conversion/AMDGPUToROCDL/
-
AMDGPUToROCDL/
3
AMDGPUToROCDL.cpp
-
Dialect/AMDGPU/IR/
-
AMDGPU/
-
IR/
-
AMDGPUDialect.cpp
-
test/
-
Conversion/AMDGPUToROCDL/
-
AMDGPUToROCDL/
-
wmma.mlir
-
Dialect/AMDGPU/
-
AMDGPU/
-
invalid.mlir
-
ops.mlir
-
Target/LLVMIR/
-
LLVMIR/
-
rocdl.mlir

Differential D152451

[mlir][AMDGPU] Define wrappers for WMMA matrix ops
ClosedPublic

Authored by giuseros on Jun 8 2023, 8:42 AM.

Download Raw Diff

Details

Reviewers

ftynse
dcaballe
nicolasvasilache
herhut
krzysz00

Commits

rG4b3eaee2701a: [mlir][AMDGPU] Define wrappers for WMMA matrix ops

Summary

Wave Matrix Multiply Accumulate (WMMA) is the instruction to accelerate
matrix multiplication on RDNA3 architectures. LLVM already provides a
set of intrinsics to generate wmma instructions. This change uses those
intrinsics to enable the feature in MLIR.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

giuseros created this revision.Jun 8 2023, 8:42 AM

Herald added a reviewer: ftynse. · View Herald TranscriptJun 8 2023, 8:42 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 29 others. · View Herald Transcript

giuseros requested review of this revision.Jun 8 2023, 8:42 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 8 2023, 8:42 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, wdng. · View Herald Transcript

Harbormaster completed remote builds in B237506: Diff 529610.Jun 8 2023, 8:58 AM

giuseros added a reviewer: krzysz00.Jun 19 2023, 6:16 AM

Great to see this is coming in.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
368	I expect there's less chance the flag is misused compared to the input type is incorrectly set at some point of the lowering path. I fell slightly better to have unsignedA/B as optional and ignore the elemType when unsignedA/B is given but no big deal.
544	Does it check sourceA.type == sourceB.type ?

Herald added subscribers: gysit, Dinistro. · View Herald TranscriptJun 23 2023, 8:54 AM

@jungpark-mlir Ping.

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
544	Yeah, we have a `AllTypesMatch<["sourceA", "sourceB"]>` on the op definition.

Approving since I don't think there's any good reason not to merge this.

This revision is now accepted and ready to land.Jul 10 2023, 1:14 PM

Closed by commit rG4b3eaee2701a: [mlir][AMDGPU] Define wrappers for WMMA matrix ops (authored by giuseros, committed by krzysz00). · Explain WhyJul 20 2023, 11:38 AM

This revision was automatically updated to reflect the committed changes.

krzysz00 added a commit: rG4b3eaee2701a: [mlir][AMDGPU] Define wrappers for WMMA matrix ops.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

AMDGPU/

IR/

AMDGPU.td

41 lines

LLVMIR/

ROCDLOps.td

21 lines

lib/

Conversion/

AMDGPUToROCDL/

AMDGPUToROCDL.cpp

130 lines

Dialect/

AMDGPU/

IR/

AMDGPUDialect.cpp

28 lines

test/

Conversion/

AMDGPUToROCDL/

wmma.mlir

27 lines

Dialect/

AMDGPU/

invalid.mlir

8 lines

ops.mlir

7 lines

Target/

LLVMIR/

rocdl.mlir

60 lines

Diff 542611

mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td

Show First 20 Lines • Show All 374 Lines • ▼ Show 20 Lines	def MFMAInTypes : AnyTypeOf<[F32, F64, I32, I64,
VectorOfLengthAndType<[4], [F16]>,		VectorOfLengthAndType<[4], [F16]>,
VectorOfLengthAndType<[2, 4], [BF16]>,		VectorOfLengthAndType<[2, 4], [BF16]>,
VectorOfLengthAndType<[4, 8], [I8]>,		VectorOfLengthAndType<[4, 8], [I8]>,
VectorOfLengthAndType<[8], [F8E5M2FNUZ, F8E4M3FNUZ]>]>;		VectorOfLengthAndType<[8], [F8E5M2FNUZ, F8E4M3FNUZ]>]>;
def MFMAOutTypes : AnyTypeOf<[F64,		def MFMAOutTypes : AnyTypeOf<[F64,
VectorOfLengthAndType<[4, 16, 32], [F32]>,		VectorOfLengthAndType<[4, 16, 32], [F32]>,
VectorOfLengthAndType<[4, 16, 32], [I32]>,		VectorOfLengthAndType<[4, 16, 32], [I32]>,
VectorOfLengthAndType<[4], [F64]>]>;		VectorOfLengthAndType<[4], [F64]>]>;
		// wmma
		def WMMAInTypes : AnyTypeOf<[VectorOfLengthAndType<[16], [F16, BF16, I8, SI8, UI8]>]>;
		def WMMAOutTypes : AnyTypeOf<[VectorOfLengthAndType<[4, 8], [F32, I32]>,
		VectorOfLengthAndType<[8, 16], [F16, BF16]>]>;

def AMDGPU_MFMAOp :		def AMDGPU_MFMAOp :
AMDGPU_Op<"mfma", [AllTypesMatch<["destC", "destD"]>,		AMDGPU_Op<"mfma", [AllTypesMatch<["destC", "destD"]>,
Pure]>,		Pure]>,
Arguments<(ins		Arguments<(ins
I32Attr:$m,		I32Attr:$m,
I32Attr:$n,		I32Attr:$n,
I32Attr:$k,		I32Attr:$k,
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	let assemblyFormat = [{
$sourceA `*` $sourceB `+` $destC		$sourceA `*` $sourceB `+` $destC
attr-dict		attr-dict
`blgp` `=` $blgp		`blgp` `=` $blgp
`:` type($sourceA) `,` type($sourceB) `,` type($destC)		`:` type($sourceA) `,` type($sourceB) `,` type($destC)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

		def AMDGPU_WMMAOp :
		AMDGPU_Op<"wmma", [AllTypesMatch<["destC", "destD"]>,
		AllTypesMatch<["sourceA", "sourceB"]>,
		Pure]>,
		Arguments<(ins
		WMMAInTypes:$sourceA,
		WMMAInTypes:$sourceB,
		WMMAOutTypes:$destC,
		DefaultValuedAttr<ConfinedAttr<I32Attr, [IntMinValue<0>, IntMaxValue<1>]>, "0">:$subwordOffset,
		UnitAttr:$unsignedA,
		UnitAttr:$unsignedB,
		UnitAttr:$clamp)>,
		Results<(outs WMMAOutTypes: $destD)> {
		let summary = "MLIR wrapper for RDNA3 wmma instructions";
		let description = [{
		The `amdgpu.wmma` op is an MLIR wrapper around intrinsics
		for various `wmma` instructions in the RDNA3 architecture, which perform
		a 16x16 matrix multiplication for different data types.

		When emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector
		containing only 8 valid values:
		- If `subwordOffset` is 0, then the output is stored at indices 0, 2, 4, ..., 14.
		- If `subwordOffset` is 1, then the output is stored at indices 1, 3, 5, ..., 15.

		`unsignedA` and `unsignedB` flag that the `int8` LLVM inputs are unsigned.

		The `clamp` flag is used to saturate the output of type T to numeric_limits<T>::max()
		in case of overflow.
		}];
		let assemblyFormat = [{
		$sourceA `*` $sourceB `+` $destC
		attr-dict
		`:` type($sourceA) `,` type($sourceB) `,` type($destC)
		}];
		let hasVerifier = 1;
		}

#endif // AMDGPU		#endif // AMDGPU

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	builder.CreateFence(llvm::AtomicOrdering::Release,
llvmContext.getOrInsertSyncScopeID("workgroup"));		llvmContext.getOrInsertSyncScopeID("workgroup"));
createIntrinsicCall(builder, llvm::Intrinsic::amdgcn_s_barrier);		createIntrinsicCall(builder, llvm::Intrinsic::amdgcn_s_barrier);
builder.CreateFence(llvm::AtomicOrdering::Acquire,		builder.CreateFence(llvm::AtomicOrdering::Acquire,
llvmContext.getOrInsertSyncScopeID("workgroup"));		llvmContext.getOrInsertSyncScopeID("workgroup"));
}];		}];
let assemblyFormat = "attr-dict";		let assemblyFormat = "attr-dict";
}		}


//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
// Xdlops intrinsics		// Xdlops intrinsics

class ROCDL_Mfma_IntrOp<string mnemonic, list<Trait> traits = []> :		class ROCDL_Mfma_IntrOp<string mnemonic, list<Trait> traits = []> :
LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,		LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
"amdgcn_" # !subst(".","_", mnemonic),		"amdgcn_" # !subst(".","_", mnemonic),
[], [], traits, 1>,		[], [], traits, 1>,
Arguments<(ins Variadic<LLVM_Type>:$args)> {		Arguments<(ins Variadic<LLVM_Type>:$args)> {
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
def ROCDL_mfma_f32_16x16x32_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.bf8">;		def ROCDL_mfma_f32_16x16x32_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.bf8">;
def ROCDL_mfma_f32_16x16x32_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.fp8">;		def ROCDL_mfma_f32_16x16x32_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.16x16x32.fp8.fp8">;
def ROCDL_mfma_f32_32x32x16_bf8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.bf8">;		def ROCDL_mfma_f32_32x32x16_bf8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.bf8">;
def ROCDL_mfma_f32_32x32x16_bf8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.fp8">;		def ROCDL_mfma_f32_32x32x16_bf8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.bf8.fp8">;
def ROCDL_mfma_f32_32x32x16_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.bf8">;		def ROCDL_mfma_f32_32x32x16_fp8_bf8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.bf8">;
def ROCDL_mfma_f32_32x32x16_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.fp8">;		def ROCDL_mfma_f32_32x32x16_fp8_fp8 : ROCDL_Mfma_IntrOp<"mfma.f32.32x32x16.fp8.fp8">;

//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
		// WMMA intrinsics
		class ROCDL_Wmma_IntrOp<string mnemonic, list<Trait> traits = []> :
		LLVM_IntrOpBase<ROCDL_Dialect, mnemonic,
		"amdgcn_" # !subst(".","_", mnemonic),
		[0], [], traits, 1>,
		Arguments<(ins Variadic<LLVM_Type>:$args)> {
		let assemblyFormat =
		"$args attr-dict `:` functional-type($args, $res)";
		}

		// Available on RDNA3
		def ROCDL_wmma_f32_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.f16">;
		def ROCDL_wmma_f32_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.f32.16x16x16.bf16">;
		def ROCDL_wmma_f16_16x16x16_f16 : ROCDL_Wmma_IntrOp<"wmma.f16.16x16x16.f16">;
		def ROCDL_wmma_bf16_16x16x16_bf16 : ROCDL_Wmma_IntrOp<"wmma.bf16.16x16x16.bf16">;
		def ROCDL_wmma_i32_16x16x16_iu8 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu8">;
		def ROCDL_wmma_i32_16x16x16_iu4 : ROCDL_Wmma_IntrOp<"wmma.i32.16x16x16.iu4">;


		//===---------------------------------------------------------------------===//
// Vector buffer load/store intrinsics		// Vector buffer load/store intrinsics

def ROCDL_MubufLoadOp :		def ROCDL_MubufLoadOp :
ROCDL_Op<"buffer.load">,		ROCDL_Op<"buffer.load">,
Results<(outs LLVM_Type:$res)>,		Results<(outs LLVM_Type:$res)>,
Arguments<(ins LLVM_Type:$rsrc,		Arguments<(ins LLVM_Type:$rsrc,
LLVM_Type:$vindex,		LLVM_Type:$vindex,
LLVM_Type:$offset,		LLVM_Type:$offset,
▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

Show All 26 Lines
using namespace mlir::amdgpu;		using namespace mlir::amdgpu;

static Value createI32Constant(ConversionPatternRewriter &rewriter,		static Value createI32Constant(ConversionPatternRewriter &rewriter,
Location loc, int32_t value) {		Location loc, int32_t value) {
Type llvmI32 = rewriter.getI32Type();		Type llvmI32 = rewriter.getI32Type();
return rewriter.create<LLVM::ConstantOp>(loc, llvmI32, value);		return rewriter.create<LLVM::ConstantOp>(loc, llvmI32, value);
}		}

		static Value createI1Constant(ConversionPatternRewriter &rewriter, Location loc,
		bool value) {
		Type llvmI1 = rewriter.getI1Type();
		return rewriter.createOrFold<LLVM::ConstantOp>(loc, llvmI1, value);
		}

namespace {		namespace {
/// Define lowering patterns for raw buffer ops		/// Define lowering patterns for raw buffer ops
template <typename GpuOp, typename Intrinsic>		template <typename GpuOp, typename Intrinsic>
struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {		struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {
RawBufferOpLowering(LLVMTypeConverter &converter, Chipset chipset)		RawBufferOpLowering(LLVMTypeConverter &converter, Chipset chipset)
: ConvertOpToLLVMPattern<GpuOp>(converter), chipset(chipset) {}		: ConvertOpToLLVMPattern<GpuOp>(converter), chipset(chipset) {}

Chipset chipset;		Chipset chipset;
▲ Show 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	for (int64_t i = 0; i < numBytes; ++i) {
Value shifted = rewriter.create<LLVM::ShlOp>(loc, extended, shiftConst);		Value shifted = rewriter.create<LLVM::ShlOp>(loc, extended, shiftConst);
result = rewriter.create<LLVM::OrOp>(loc, result, shifted);		result = rewriter.create<LLVM::OrOp>(loc, result, shifted);
}		}
return result;		return result;
}		}
return input;		return input;
}		}

		/// Push an input operand. If it is a float type, nothing to do. If it is
		/// an integer type, then we need to also push its signdness (1 for signed, 0
		/// for unsigned) and we need to pack the input 16xi8 vector into a 4xi32
		/// vector.
		static void wmmaPushInputOperand(ConversionPatternRewriter &rewriter,
		Location loc, TypeConverter *typeConverter,
		bool isUnsigned, Value llvmInput,
		SmallVector<Value, 4> &operands) {
		Type inputType = llvmInput.getType();
		auto vectorType = inputType.dyn_cast<VectorType>();
		Type elemType = vectorType.getElementType();

		if (!elemType.isInteger(8)) {
		operands.push_back(llvmInput);
		return;
		}

		int64_t numBytes = vectorType.getNumElements();
		Type i32 = rewriter.getI32Type();
		VectorType vectorType32bits = VectorType::get(numBytes * 8 / 32, i32);
		auto llvmVectorType32bits = typeConverter->convertType(vectorType32bits);

		Value result = rewriter.createOrFold<LLVM::BitcastOp>(
		loc, llvmVectorType32bits, llvmInput);

		// if element type is 8-bit signed or unsigned, ignore the isUnsigned flag
		jungpark-mlirUnsubmitted Not Done Reply Inline Actions I expect there's less chance the flag is misused compared to the input type is incorrectly set at some point of the lowering path. I fell slightly better to have unsignedA/B as optional and ignore the elemType when unsignedA/B is given but no big deal. jungpark-mlir: I expect there's less chance the flag is misused compared to the input type is incorrectly set…
		bool localIsUnsigned = isUnsigned;
		if (elemType.isUnsignedInteger(8)) {
		localIsUnsigned = true;
		} else if (elemType.isSignedInteger(8)) {
		localIsUnsigned = false;
		}
		Value sign = createI1Constant(rewriter, loc, !localIsUnsigned);
		operands.push_back(sign);
		operands.push_back(result);
		}

		/// Push the output operand. For many cases this is only pushing the output in
		/// the operand list. But when we have f16 -> f16 or bf16 -> bf16 intrinsics,
		/// since the same numbers of VGPRs is used, we need to decide if to store the
		/// result in the upper 16 bits of the VGPRs or in the lower part. To store the
		/// result in the lower 16 bits, set subwordOffset to 1, otherwise result will
		/// be stored it in the upper part
		static void wmmaPushOutputOperand(ConversionPatternRewriter &rewriter,
		Location loc, TypeConverter *typeConverter,
		Value output, int32_t subwordOffset,
		bool clamp, SmallVector<Value, 4> &operands) {
		Type inputType = output.getType();
		auto vectorType = inputType.dyn_cast<VectorType>();
		Type elemType = vectorType.getElementType();
		operands.push_back(output);
		if (elemType.isF16() \|\| elemType.isBF16()) {
		operands.push_back(createI1Constant(rewriter, loc, subwordOffset));
		} else if (elemType.isInteger(32)) {
		operands.push_back(createI1Constant(rewriter, loc, clamp));
		}
		}

/// Return the `rocdl` intrinsic corresponding to a MFMA operation `mfma`		/// Return the `rocdl` intrinsic corresponding to a MFMA operation `mfma`
/// if one exists. This includes checking to ensure the intrinsic is supported		/// if one exists. This includes checking to ensure the intrinsic is supported
/// on the architecture you are compiling for.		/// on the architecture you are compiling for.
static std::optional<StringRef> mfmaOpToIntrinsic(MFMAOp mfma,		static std::optional<StringRef> mfmaOpToIntrinsic(MFMAOp mfma,
Chipset chipset) {		Chipset chipset) {
uint32_t m = mfma.getM(), n = mfma.getN(), k = mfma.getK(),		uint32_t m = mfma.getM(), n = mfma.getN(), k = mfma.getK(),
b = mfma.getBlocks();		b = mfma.getBlocks();
Type sourceElem = mfma.getSourceA().getType();		Type sourceElem = mfma.getSourceA().getType();
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	if (m == 32 && n == 32 && k == 16 && b == 1) {
if (sourceBElem.isFloat8E4M3FNUZ())		if (sourceBElem.isFloat8E4M3FNUZ())
return ROCDL::mfma_f32_32x32x16_fp8_fp8::getOperationName();		return ROCDL::mfma_f32_32x32x16_fp8_fp8::getOperationName();
}		}
}		}

return std::nullopt;		return std::nullopt;
}		}

		/// Return the `rocdl` intrinsic corresponding to a WMMA operation `wmma`
		/// if one exists. This includes checking to ensure the intrinsic is supported
		/// on the architecture you are compiling for.
		static std::optional<StringRef> wmmaOpToIntrinsic(WMMAOp wmma,
		Chipset chipset) {

		auto sourceVectorType = wmma.getSourceA().getType().dyn_cast<VectorType>();
		jungpark-mlirUnsubmitted Not Done Reply Inline Actions Does it check sourceA.type == sourceB.type ? jungpark-mlir: Does it check sourceA.type == sourceB.type ?
		krzysz00Unsubmitted Not Done Reply Inline Actions Yeah, we have a `AllTypesMatch<["sourceA", "sourceB"]>` on the op definition. krzysz00: Yeah, we have a `AllTypesMatch<["sourceA", "sourceB"]>` on the op definition.
		auto destVectorType = wmma.getDestC().getType().dyn_cast<VectorType>();
		auto elemSourceType = sourceVectorType.getElementType();
		auto elemDestType = destVectorType.getElementType();

		if (elemSourceType.isF16() && elemDestType.isF32()) {
		return ROCDL::wmma_f32_16x16x16_f16::getOperationName();
		} else if (elemSourceType.isBF16() && elemDestType.isF32()) {
		return ROCDL::wmma_f32_16x16x16_bf16::getOperationName();
		} else if (elemSourceType.isF16() && elemDestType.isF16()) {
		return ROCDL::wmma_f16_16x16x16_f16::getOperationName();
		} else if (elemSourceType.isBF16() && elemDestType.isBF16()) {
		return ROCDL::wmma_bf16_16x16x16_bf16::getOperationName();
		} else if (elemSourceType.isInteger(8) && elemDestType.isInteger(32)) {
		return ROCDL::wmma_i32_16x16x16_iu8::getOperationName();
		}
		return std::nullopt;
		}

namespace {		namespace {
struct MFMAOpLowering : public ConvertOpToLLVMPattern<MFMAOp> {		struct MFMAOpLowering : public ConvertOpToLLVMPattern<MFMAOp> {
MFMAOpLowering(LLVMTypeConverter &converter, Chipset chipset)		MFMAOpLowering(LLVMTypeConverter &converter, Chipset chipset)
: ConvertOpToLLVMPattern<MFMAOp>(converter), chipset(chipset) {}		: ConvertOpToLLVMPattern<MFMAOp>(converter), chipset(chipset) {}

Chipset chipset;		Chipset chipset;

LogicalResult		LogicalResult
Show All 23 Lines	loweredOp.addOperands(
createI32Constant(rewriter, loc, op.getAbid()),		createI32Constant(rewriter, loc, op.getAbid()),
createI32Constant(rewriter, loc, getBlgpField)});		createI32Constant(rewriter, loc, getBlgpField)});
Operation *lowered = rewriter.create(loweredOp);		Operation *lowered = rewriter.create(loweredOp);
rewriter.replaceOp(op, lowered->getResults());		rewriter.replaceOp(op, lowered->getResults());
return success();		return success();
}		}
};		};

		struct WMMAOpLowering : public ConvertOpToLLVMPattern<WMMAOp> {
		WMMAOpLowering(LLVMTypeConverter &converter, Chipset chipset)
		: ConvertOpToLLVMPattern<WMMAOp>(converter), chipset(chipset) {}

		Chipset chipset;

		LogicalResult
		matchAndRewrite(WMMAOp op, WMMAOpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op.getLoc();
		Type outType = typeConverter->convertType(op.getDestD().getType());

		if (chipset.majorVersion != 11)
		return op->emitOpError("WMMA only supported on gfx11");

		std::optional<StringRef> maybeIntrinsic = wmmaOpToIntrinsic(op, chipset);

		if (!maybeIntrinsic.has_value())
		return op.emitOpError("no intrinsic matching WMMA on the given chipset");

		OperationState loweredOp(loc, *maybeIntrinsic);
		loweredOp.addTypes(outType);

		SmallVector<Value, 4> operands;
		wmmaPushInputOperand(rewriter, loc, typeConverter, op.getUnsignedA(),
		adaptor.getSourceA(), operands);
		wmmaPushInputOperand(rewriter, loc, typeConverter, op.getUnsignedB(),
		adaptor.getSourceB(), operands);
		wmmaPushOutputOperand(rewriter, loc, typeConverter, adaptor.getDestC(),
		op.getSubwordOffset(), op.getClamp(), operands);

		loweredOp.addOperands(operands);
		Operation *lowered = rewriter.create(loweredOp);
		rewriter.replaceOp(op, lowered->getResults());

		return success();
		}
		};

struct ConvertAMDGPUToROCDLPass		struct ConvertAMDGPUToROCDLPass
: public impl::ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {		: public impl::ConvertAMDGPUToROCDLBase<ConvertAMDGPUToROCDLPass> {
ConvertAMDGPUToROCDLPass() = default;		ConvertAMDGPUToROCDLPass() = default;

void runOnOperation() override {		void runOnOperation() override {
MLIRContext *ctx = &getContext();		MLIRContext *ctx = &getContext();
FailureOr<Chipset> maybeChipset = Chipset::parse(chipset);		FailureOr<Chipset> maybeChipset = Chipset::parse(chipset);
if (failed(maybeChipset)) {		if (failed(maybeChipset)) {
Show All 23 Lines	patterns.add<
RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,		RawBufferOpLowering<RawBufferLoadOp, ROCDL::RawBufferLoadOp>,
RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,		RawBufferOpLowering<RawBufferStoreOp, ROCDL::RawBufferStoreOp>,
RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>,		RawBufferOpLowering<RawBufferAtomicFaddOp, ROCDL::RawBufferAtomicFAddOp>,
RawBufferOpLowering<RawBufferAtomicFmaxOp, ROCDL::RawBufferAtomicFMaxOp>,		RawBufferOpLowering<RawBufferAtomicFmaxOp, ROCDL::RawBufferAtomicFMaxOp>,
RawBufferOpLowering<RawBufferAtomicSmaxOp, ROCDL::RawBufferAtomicSMaxOp>,		RawBufferOpLowering<RawBufferAtomicSmaxOp, ROCDL::RawBufferAtomicSMaxOp>,
RawBufferOpLowering<RawBufferAtomicUminOp, ROCDL::RawBufferAtomicUMinOp>,		RawBufferOpLowering<RawBufferAtomicUminOp, ROCDL::RawBufferAtomicUMinOp>,
RawBufferOpLowering<RawBufferAtomicCmpswapOp,		RawBufferOpLowering<RawBufferAtomicCmpswapOp,
ROCDL::RawBufferAtomicCmpSwap>,		ROCDL::RawBufferAtomicCmpSwap>,
MFMAOpLowering>(converter, chipset);		MFMAOpLowering, WMMAOpLowering>(converter, chipset);
}		}

std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {		std::unique_ptr<Pass> mlir::createConvertAMDGPUToROCDLPass() {
return std::make_unique<ConvertAMDGPUToROCDLPass>();		return std::make_unique<ConvertAMDGPUToROCDLPass>();
}		}

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

	Show First 20 Lines • Show All 200 Lines • ▼ Show 20 Lines

	void RawBufferAtomicCmpswapOp::getCanonicalizationPatterns(			void RawBufferAtomicCmpswapOp::getCanonicalizationPatterns(
	RewritePatternSet &results, MLIRContext *context) {			RewritePatternSet &results, MLIRContext *context) {
	results.add<RemoveStaticallyOobBufferLoads<RawBufferAtomicCmpswapOp>>(			results.add<RemoveStaticallyOobBufferLoads<RawBufferAtomicCmpswapOp>>(
	context);			context);
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// WMMAOp
				//===----------------------------------------------------------------------===//
				LogicalResult WMMAOp::verify() {
				Type sourceAType = getSourceA().getType();
				Type destType = getDestC().getType();

				VectorType sourceVectorAType = sourceAType.dyn_cast<VectorType>();
				VectorType destVectorType = destType.dyn_cast<VectorType>();

				Type sourceAElemType = sourceVectorAType.getElementType();
				Type destElemType = destVectorType.getElementType();

				bool isDestFloat =
				(destElemType.isF32() \|\| destElemType.isF16() \|\| destElemType.isBF16());
				bool isSrcFloat = (sourceAElemType.isF16() \|\| sourceAElemType.isBF16());

				if (isDestFloat && !isSrcFloat) {
				return emitOpError("Expected float sources with float destination");
				}

				if (!isDestFloat && isSrcFloat) {
				return emitOpError("Expected int sources with int destination");
				}

				return success();
				}

				//===----------------------------------------------------------------------===//
	// MFMAOp			// MFMAOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	LogicalResult MFMAOp::verify() {			LogicalResult MFMAOp::verify() {
	constexpr uint32_t waveSize = 64;			constexpr uint32_t waveSize = 64;
	Builder b(getContext());			Builder b(getContext());

	Type sourceType = getSourceA().getType();			Type sourceType = getSourceA().getType();
	Type destType = getDestC().getType();			Type destType = getDestC().getType();
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

mlir/test/Conversion/AMDGPUToROCDL/wmma.mlir

This file was added.

				// RUN: mlir-opt %s -convert-amdgpu-to-rocdl=chipset=gfx1100 --allow-unregistered-dialect \| FileCheck %s
				func.func @mfma_to_rocdl(%arg0 : vector<16xf16>, %arg1 : vector<8xf32>, %arg2 : vector<4xf32>,
				%arg3 : vector<16xbf16>, %arg4 : vector<8xf16>, %arg5 : vector<8xbf16>,
				%arg6 : vector<16xi8>, %arg7 : vector<4xi32>, %arg8 : vector<8xi32>,
				%arg9 : vector<16xui8>){
				// CHECK: rocdl.wmma.f32.16x16x16.f16{{.*}}: (vector<16xf16>, vector<16xf16>, vector<8xf32>) -> vector<8xf32>
				amdgpu.wmma %arg0 * %arg0 + %arg1 : vector<16xf16>, vector<16xf16>, vector<8xf32>
				// CHECK: rocdl.wmma.f32.16x16x16.f16{{.*}}: (vector<16xf16>, vector<16xf16>, vector<4xf32>) -> vector<4xf32>
				amdgpu.wmma %arg0 * %arg0 + %arg2 : vector<16xf16>, vector<16xf16>, vector<4xf32>
				// CHECK: rocdl.wmma.f32.16x16x16.bf16{{.*}}: (vector<16xbf16>, vector<16xbf16>, vector<8xf32>) -> vector<8xf32>
				amdgpu.wmma %arg3 * %arg3 + %arg1 : vector<16xbf16>, vector<16xbf16>, vector<8xf32>
				// CHECK: rocdl.wmma.f32.16x16x16.bf16{{.*}}: (vector<16xbf16>, vector<16xbf16>, vector<4xf32>) -> vector<4xf32>
				amdgpu.wmma %arg3 * %arg3 + %arg2 : vector<16xbf16>, vector<16xbf16>, vector<4xf32>
				// CHECK: rocdl.wmma.f16.16x16x16.f16{{.*}}: (vector<16xf16>, vector<16xf16>, vector<16xf16>, i1) -> vector<16xf16>
				amdgpu.wmma %arg0 * %arg0 + %arg0 {subwordOffset = 1 : i32}: vector<16xf16>, vector<16xf16>, vector<16xf16>
				// CHECK: rocdl.wmma.f16.16x16x16.f16{{.*}}: (vector<16xf16>, vector<16xf16>, vector<8xf16>, i1) -> vector<8xf16>
				amdgpu.wmma %arg0 * %arg0 + %arg4 {subwordOffset = 0 : i32}: vector<16xf16>, vector<16xf16>, vector<8xf16>
				// CHECK: rocdl.wmma.bf16.16x16x16.bf16{{.*}}: (vector<16xbf16>, vector<16xbf16>, vector<16xbf16>, i1) -> vector<16xbf16>
				amdgpu.wmma %arg3 * %arg3 + %arg3 {subwordOffset = 1 : i32}: vector<16xbf16>, vector<16xbf16>, vector<16xbf16>
				// CHECK: rocdl.wmma.bf16.16x16x16.bf16{{.*}}: (vector<16xbf16>, vector<16xbf16>, vector<8xbf16>, i1) -> vector<8xbf16>
				amdgpu.wmma %arg3 * %arg3 + %arg5 {subwordOffset = 0 : i32}: vector<16xbf16>, vector<16xbf16>, vector<8xbf16>
				// CHECK: rocdl.wmma.i32.16x16x16.iu8{{.*}}: (i1, vector<4xi32>, i1, vector<4xi32>, vector<4xi32>, i1) -> vector<4xi32>
				amdgpu.wmma %arg6 * %arg6 + %arg7 {clamp}: vector<16xi8>, vector<16xi8>, vector<4xi32>
				// CHECK: rocdl.wmma.i32.16x16x16.iu8{{.*}}: (i1, vector<4xi32>, i1, vector<4xi32>, vector<8xi32>, i1) -> vector<8xi32>
				amdgpu.wmma %arg9 * %arg9 + %arg8 {unsignedA, unsignedB, clamp}: vector<16xui8>, vector<16xui8>, vector<8xi32>
				func.return
				}

mlir/test/Dialect/AMDGPU/invalid.mlir

	Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines

	func.func @no_negation(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {			func.func @no_negation(%a: f32, %b: f32, %c: vector<32xf32>) -> vector<32xf32> {
	// expected-error@+1 {{'amdgpu.mfma' op negation flags only available for double-precision operations}}			// expected-error@+1 {{'amdgpu.mfma' op negation flags only available for double-precision operations}}
	%d = amdgpu.mfma %a * %b + %c {			%d = amdgpu.mfma %a * %b + %c {
	m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,			m = 32 : i32, n = 32 : i32, k = 1 : i32, blocks = 2 : i32,
	abid = 0 : i32, cbsz = 0 : i32, negateA} blgp = none : f32, f32, vector<32xf32>			abid = 0 : i32, cbsz = 0 : i32, negateA} blgp = none : f32, f32, vector<32xf32>
	func.return %d : vector<32xf32>			func.return %d : vector<32xf32>
	}			}

				// -----

				func.func @wmma(%arg0 : vector<16xf16>, %arg1 : vector<8xi32>) -> vector<8xi32> {
				// expected-error@+1 {{'amdgpu.wmma' op Expected int sources with int destination}}
				%0 = amdgpu.wmma %arg0 * %arg0 + %arg1 : vector<16xf16>, vector<16xf16>, vector<8xi32>
				func.return %0 : vector<8xi32>
				}

mlir/test/Dialect/AMDGPU/ops.mlir

	Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
	}			}

	// CHECK-LABEL: func @mfma			// CHECK-LABEL: func @mfma
	func.func @mfma(%arg0 : f32, %arg1 : vector<32xf32>) -> vector<32xf32> {			func.func @mfma(%arg0 : f32, %arg1 : vector<32xf32>) -> vector<32xf32> {
	// CHECK: amdgpu.mfma			// CHECK: amdgpu.mfma
	%0 = amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = bcast_second_32 : f32, f32, vector<32xf32>			%0 = amdgpu.mfma %arg0 * %arg0 + %arg1 { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } blgp = bcast_second_32 : f32, f32, vector<32xf32>
	func.return %0 : vector<32xf32>			func.return %0 : vector<32xf32>
	}			}

				// CHECK-LABEL: func @wmma
				func.func @wmma(%arg0 : vector<16xf16>, %arg1 : vector<8xf16>) -> vector<8xf16> {
				// CHECK: amdgpu.wmma
				%0 = amdgpu.wmma %arg0 * %arg0 + %arg1 : vector<16xf16>, vector<16xf16>, vector<8xf16>
				func.return %0 : vector<8xf16>
				}

mlir/test/Target/LLVMIR/rocdl.mlir

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	llvm.func @rocdl.xdlops(%arg0 : f32, %arg1 : f32,

// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.bf8.bf8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})		// CHECK: call <16 x float> @llvm.amdgcn.mfma.f32.32x32x16.bf8.bf8(i64 %{{.}}, i64 %{{.}}, <16 x float> %{{.}}, i32 {{.}}, i32 {{.}}, i32 {{.}})
%r27 = rocdl.mfma.f32.32x32x16.bf8.bf8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :		%r27 = rocdl.mfma.f32.32x32x16.bf8.bf8 %arg11, %arg11, %arg4, %csti32, %csti32, %csti32 :
(i64, i64, vector<16xf32>,		(i64, i64, vector<16xf32>,
i32, i32, i32) -> vector<16xf32>		i32, i32, i32) -> vector<16xf32>
llvm.return %r0 : vector<32 x f32>		llvm.return %r0 : vector<32 x f32>
}		}

		llvm.func @rocdl.wmma(%arg0 : vector<8xf32>, %arg1 : vector<16 x f16>, %arg2 : vector<16 x i16>, %arg3 : vector<8 x i32>,
		%arg4 : vector<2xi32>, %arg5 : vector<4xi32>, %arg6 : vector<4xf32>, %arg7 : vector<8xf16>, %arg8 : vector<8xi16>) -> vector<8xf32> {
		%zero = llvm.mlir.constant(false) : i1

		// ---- Wave32 -----

		// f16 -> f32
		// CHECK: call <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16.v8f32(<16 x half> %{{.}}, <16 x half> %{{.}}, <8 x float> %{{.*}})
		%r0 = rocdl.wmma.f32.16x16x16.f16 %arg1, %arg1, %arg0 : (vector<16xf16>, vector<16xf16>, vector<8xf32>) -> vector<8xf32>

		// bf16 -> f32
		// CHECK: call <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16.v8f32(<16 x i16> %{{.}}, <16 x i16> %{{.}}, <8 x float> %{{.*}})
		%r1 = rocdl.wmma.f32.16x16x16.bf16 %arg2, %arg2, %arg0 : (vector<16xi16>, vector<16xi16>, vector<8xf32>) -> vector<8xf32>

		// f16 -> f16 (OPSEL = {0,1})
		// CHECK: call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16.v16f16(<16 x half> %{{.}}, <16 x half> %{{.}}, <16 x half> %{{.}}, i1 {{.}})
		%r2 = rocdl.wmma.f16.16x16x16.f16 %arg1, %arg1, %arg1, %zero : (vector<16xf16>, vector<16xf16>, vector<16xf16>, i1) -> vector<16xf16>

		// bf16 -> bf16 (OPSEL = {0,1})
		// CHECK: call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16.v16i16(<16 x i16> %{{.}}, <16 x i16> %{{.}}, <16 x i16> %{{.}}, i1 {{.}})
		%r4 = rocdl.wmma.bf16.16x16x16.bf16 %arg2, %arg2, %arg2, %zero : (vector<16xi16>, vector<16xi16>, vector<16xi16>, i1) -> vector<16xi16>

		// int8 -> int32 (signA = {0,1}, signB = {0,1}, clamp = {0,1})
		// CHECK: call <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8.v8i32(i1 {{.}}, <4 x i32> %{{.}}, i1 {{.}}, <4 x i32> %{{.}}, <8 x i32> %{{.}}, i1 {{.}})
		%r5 = rocdl.wmma.i32.16x16x16.iu8 %zero, %arg5, %zero, %arg5, %arg3, %zero : (i1, vector<4xi32>, i1, vector<4xi32>, vector<8xi32>, i1) -> vector<8xi32>

		// int4 -> int32 (signA = {0,1}, signB = {0,1}, clamp = {0,1})
		// CHECK: call <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4.v8i32(i1 {{.}}, <2 x i32> %{{.}}, i1 {{.}}, <2 x i32> %{{.}}, <8 x i32> %{{.}}, i1 {{.}})
		%r6 = rocdl.wmma.i32.16x16x16.iu4 %zero, %arg4, %zero, %arg4, %arg3, %zero : (i1, vector<2xi32>, i1, vector<2xi32>, vector<8xi32>, i1) -> vector<8xi32>

		// ---- Wave64 -----

		// f16 -> f32
		// CHECK: call <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16.v4f32(<16 x half> %{{.}}, <16 x half> %{{.}}, <4 x float> %{{.*}})
		%r7 = rocdl.wmma.f32.16x16x16.f16 %arg1, %arg1, %arg6 : (vector<16xf16>, vector<16xf16>, vector<4xf32>) -> vector<4xf32>

		// bf16 -> f32
		// CHECK: call <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16.v4f32(<16 x i16> %{{.}}, <16 x i16> %{{.}}, <4 x float> %{{.*}})
		%r8 = rocdl.wmma.f32.16x16x16.bf16 %arg2, %arg2, %arg6 : (vector<16xi16>, vector<16xi16>, vector<4xf32>) -> vector<4xf32>

		// f16 -> f16 (OPSEL = {0,1})
		// CHECK: call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16.v8f16(<16 x half> %{{.}}, <16 x half> %{{.}}, <8 x half> %{{.}}, i1 {{.}})
		%r9 = rocdl.wmma.f16.16x16x16.f16 %arg1, %arg1, %arg7, %zero : (vector<16xf16>, vector<16xf16>, vector<8xf16>, i1) -> vector<8xf16>

		// bf16 -> bf16 (OPSEL = {0,1})
		// CHECK: call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16.v8i16(<16 x i16> %{{.}}, <16 x i16> %{{.}}, <8 x i16> %{{.}}, i1 {{.}})
		%r11 = rocdl.wmma.bf16.16x16x16.bf16 %arg2, %arg2, %arg8, %zero : (vector<16xi16>, vector<16xi16>, vector<8xi16>, i1) -> vector<8xi16>

		// int8 -> int32 (signA = {0,1}, signB = {0,1}, clamp = {0,1})
		// CHECK: call <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8.v4i32(i1 {{.}}, <4 x i32> %{{.}}, i1 {{.}}, <4 x i32> %{{.}}, <4 x i32> %{{.}}, i1 {{.}})
		%r12 = rocdl.wmma.i32.16x16x16.iu8 %zero, %arg5, %zero, %arg5, %arg5, %zero : (i1, vector<4xi32>, i1, vector<4xi32>, vector<4xi32>, i1) -> vector<4xi32>

		// int4 -> int32 (signA = {0,1}, signB = {0,1}, clamp = {0,1})
		// CHECK: call <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4.v4i32(i1 {{.}}, <2 x i32> %{{.}}, i1 {{.}}, <2 x i32> %{{.}}, <4 x i32> %{{.}}, i1 {{.}})
		%r13 = rocdl.wmma.i32.16x16x16.iu4 %zero, %arg4, %zero, %arg4, %arg5, %zero : (i1, vector<2xi32>, i1, vector<2xi32>, vector<4xi32>, i1) -> vector<4xi32>

		llvm.return %r0 : vector<8xf32>
		}


llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,		llvm.func @rocdl.mubuf(%rsrc : vector<4xi32>, %vindex : i32,
%offset : i32, %vdata1 : vector<1xf32>,		%offset : i32, %vdata1 : vector<1xf32>,
%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {		%vdata2 : vector<2xf32>, %vdata4 : vector<4xf32>) {
%glc = llvm.mlir.constant(false) : i1		%glc = llvm.mlir.constant(false) : i1
%slc = llvm.mlir.constant(true) : i1		%slc = llvm.mlir.constant(true) : i1
// CHECK-LABEL: rocdl.mubuf		// CHECK-LABEL: rocdl.mubuf
// CHECK: call <1 x float> @llvm.amdgcn.buffer.load.v1f32(<4 x i32> %{{.}}, i32 %{{.}}, i32 %{{.}}, i1 {{.}}, i1 {{.*}})		// CHECK: call <1 x float> @llvm.amdgcn.buffer.load.v1f32(<4 x i32> %{{.}}, i32 %{{.}}, i32 %{{.}}, i1 {{.}}, i1 {{.*}})
// CHECK: call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %{{.}}, i32 %{{.}}, i32 %{{.}}, i1 {{.}}, i1 {{.*}})		// CHECK: call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %{{.}}, i32 %{{.}}, i32 %{{.}}, i1 {{.}}, i1 {{.*}})
▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] Define wrappers for WMMA matrix opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 542611

mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td

mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp

mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp

mlir/test/Conversion/AMDGPUToROCDL/wmma.mlir

mlir/test/Dialect/AMDGPU/invalid.mlir

mlir/test/Dialect/AMDGPU/ops.mlir

mlir/test/Target/LLVMIR/rocdl.mlir

[mlir][AMDGPU] Define wrappers for WMMA matrix ops
ClosedPublic