This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
448	Is it ok to move all inline asm (emit* functions including `emitCpAsyncOpZfillAsm`) into a its own file `NVGPUToNVASM.[h/cpp]`? Let me know how you feel about it. I think we will need a few more inline asm, that I can think of, before it becomes available in nvvm backend. We can just edit NVGPUToNVASM.* file as we add more asm and prune it when it becomes available through NVVM.

ThomasRaoux added inline comments.Nov 3 2022, 10:53 AM

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
448	I would prefer not create a file just to have inline assembly. It doesn't sound like a very natural separation of the code to me.

manishucsd accepted this revision.Nov 3 2022, 12:08 PM

christopherbate marked 2 inline comments as done.Nov 7 2022, 8:42 AM

Closed by commit rG708185f03ff4: [mlir][NVGPU] Add support for structured sparsity MMA variants (authored by christopherbate). · Explain WhyNov 7 2022, 8:43 AM

This revision was automatically updated to reflect the committed changes.

christopherbate added a commit: rG708185f03ff4: [mlir][NVGPU] Add support for structured sparsity MMA variants.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

NVGPU/

IR/

NVGPU.td

78 lines

lib/

Conversion/

NVGPUToNVVM/

NVGPUToNVVM.cpp

249 lines

Dialect/

NVGPU/

IR/

NVGPUDialect.cpp

108 lines

test/

Conversion/

NVGPUToNVVM/

nvgpu-to-nvvm.mlir

116 lines

Dialect/

NVGPU/

roundtrip.mlir

38 lines

Diff 473692

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	def NVGPU_LdMatrixOp : NVGPU_Op<"ldmatrix", [
let results = (outs AnyVector:$res);		let results = (outs AnyVector:$res);
let assemblyFormat = [{		let assemblyFormat = [{
$srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)		$srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)
}];		}];

let hasVerifier = 1;		let hasVerifier = 1;
}		}

def NVGPU_MmaSyncOp : NVGPU_Op<"mma.sync", [		class NVGPU_MmaSyncOp<string mnemonic> :
Pure,		NVGPU_Op<mnemonic, [Pure,
PredOpTrait<"matrixA and matrixB have same element type",		PredOpTrait<"matrixA and matrixB have same element type",
TCopVTEtIsSameAs<0, 1>>]> {		TCopVTEtIsSameAs<0, 1>>]> {
		code extraBaseClassDeclaration = [{
		std::array<int64_t, 3> getMmaShapeAsArray() {
		ArrayAttr mmaShape = this->getMmaShape();
		assert(mmaShape.size() == 3 && "mmaShape should be three integers");
		return {mmaShape[0].cast<IntegerAttr>().getInt(),
		mmaShape[1].cast<IntegerAttr>().getInt(),
		mmaShape[2].cast<IntegerAttr>().getInt()};
		}
		}];

		let hasVerifier = 1;
		}

		def NVGPU_MmaSyncOp : NVGPU_MmaSyncOp<"mma.sync"> {
let description = [{		let description = [{
The `nvgpu.mma.sync` op represents the warp-level matrix-multiply-and-		The `nvgpu.mma.sync` op represents the warp-level matrix-multiply-and-
accumulate (mma) operation that is compatible with `nvvm.mma.sync`.		accumulate (mma) operation that is compatible with `nvvm.mma.sync`.
The operands and results vector sizes are thread-level onwership to		The operands and results vector sizes are thread-level onwership to
the warp-level mma operation shape. `mmaShape` attribute holds the		the warp-level mma operation shape. `mmaShape` attribute holds the
warp-level matrix-multiply shape.		warp-level matrix-multiply shape.

The `nvgpu.mma.sync` op serves as an intermediate point between lowering from		The `nvgpu.mma.sync` op serves as an intermediate point between lowering from
Show All 25 Lines	OpBuilder<(ins "Value":$matrixA,
"ArrayAttr":$mmaShape)>		"ArrayAttr":$mmaShape)>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict		`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)		`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
}];		}];

let hasVerifier = 1;		let extraClassDeclaration = extraBaseClassDeclaration;
}		}

		def NVGPU_MmaSparseSyncMetadataType : FixedVectorOfLengthAndType<[2], [I16]>,
		BuildableType<"::mlir::VectorType::get("
		"{2},$_builder.getI16Type())">;

		def NVGPU_MmaSparseSyncOp : NVGPU_MmaSyncOp<"mma.sp.sync"> {
		let description = [{
		The `nvgu.mma.sp.sync` operation performs a warp-distributed MMA operation
		where operand A is "structured sparse". In this case, the `matrixA` operand
		represents the (warp-distributed) non-zero values of operand A, and the
		`sparse_metadata` operand provides the indices.

		The full description of the sparsity storage format and distribution scheme is
		described in the PTX docs. This operation is meant to follow the semantic
		described in the PTX documentation here:
		https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma

		The way the indices are distributed among the threads in a warp is controlled
		by the optional `sparsity_selector` operand, which is `0` by default. For
		more information, please consult the PTX documentation linked above.

		Example (targetingthe f16 16x8x32 `mma.sp` PTX instruction):

		```mlir
		nvgpu.mma.sp.sync (%a, %b, %c) metadata (%meta) {mmaShape = [16, 8, 32]} :
		(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		```
		}];

		let arguments = (ins AnyVector:$matrixA,
		AnyVector:$matrixB,
		AnyVector:$matrixC,
		NVGPU_MmaSparseSyncMetadataType:$sparseMetadata,
		I64ArrayAttr:$mmaShape,
		DefaultValuedAttr<I32Attr, "0">:$sparsitySelector,
		OptionalAttr<UnitAttr>:$tf32Enabled
		);

		let results = (outs AnyVector:$res);

		let builders = [
		OpBuilder<(ins "Value":$matrixA,
		"Value":$matrixB,
		"Value":$matrixC,
		"Value":$sparseMetadata,
		"ArrayRef<int64_t>":$mmaShape)>
		];

		let assemblyFormat = [{
		`(` $matrixA`,` $matrixB`,` $matrixC `)` `metadata` `(` $sparseMetadata `)` attr-dict
		`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
		}];

		let extraClassDeclaration = extraBaseClassDeclaration;
		}

def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [		def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy", [
AttrSizedOperandSegments]> {		AttrSizedOperandSegments]> {
let summary = "device-side asynchronous copy";		let summary = "device-side asynchronous copy";
let description = [{		let description = [{
The `nvgpu.device_async_copy` op initiates an asynchronous copy operation of		The `nvgpu.device_async_copy` op initiates an asynchronous copy operation of
elements from source (global memory) to the destination (shared memory)		elements from source (global memory) to the destination (shared memory)
without blocking the thread. The async copy is added to a group.		without blocking the thread. The async copy is added to a group.
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//		//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"		#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"

#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"		#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
		#include "mlir/IR/TypeUtilities.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_CONVERTNVGPUTONVVM		#define GEN_PASS_DEF_CONVERTNVGPUTONVVM
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;
▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	for (int64_t i = 0, e = vectorResultType.getDimSize(0); i < e; i++) {
result = rewriter.create<LLVM::InsertValueOp>(loc, result, casted, i);		result = rewriter.create<LLVM::InsertValueOp>(loc, result, casted, i);
}		}

rewriter.replaceOp(op, result);		rewriter.replaceOp(op, result);
return success();		return success();
}		}
};		};

		/// Convert the given type into the corresponding PTX type (NVVM::MMATypes
		/// enum).
		static FailureOr<NVVM::MMATypes> getNvvmMmaType(Type t) {
		Type elType = getElementTypeOrSelf(t);
		if (elType.isInteger(8))
		return NVVM::MMATypes::s8;
		if (elType.isInteger(4))
		return NVVM::MMATypes::s4;
		if (elType.isF16())
		return NVVM::MMATypes::f16;
		if (elType.isF64())
		return NVVM::MMATypes::f64;
		if (elType.isF32())
		return NVVM::MMATypes::tf32;
		return failure();
		}

struct MmaSyncOptoNVVM : public ConvertOpToLLVMPattern<nvgpu::MmaSyncOp> {		struct MmaSyncOptoNVVM : public ConvertOpToLLVMPattern<nvgpu::MmaSyncOp> {
using ConvertOpToLLVMPattern<nvgpu::MmaSyncOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<nvgpu::MmaSyncOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::MmaSyncOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::MmaSyncOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Location loc = op->getLoc();		Location loc = op->getLoc();
// Get the shapes of the MMAMatrix type being used. The shapes will		// Get the shapes of the MMAMatrix type being used. The shapes will
// choose which intrinsic this op will be lowered to.		// choose which intrinsic this op will be lowered to.
auto aType = op.getMatrixA().getType().cast<VectorType>();		VectorType aType = op.getMatrixA().getType();
auto cType = op.getMatrixC().getType().cast<VectorType>();		VectorType bType = op.getMatrixA().getType();
		VectorType cType = op.getMatrixC().getType();

		std::array<int64_t, 3> gemmShape = op.getMmaShapeAsArray();

int64_t m = op.getMmaShape()[0].cast<IntegerAttr>().getInt();		// Tensor Cores (mma.sync) on F32 works only with TensorFloat32 (TF32).
int64_t n = op.getMmaShape()[1].cast<IntegerAttr>().getInt();		bool tf32Enabled = op->hasAttr(op.getTf32EnabledAttrName());
int64_t k = op.getMmaShape()[2].cast<IntegerAttr>().getInt();		if (aType.getElementType().isF32() && !tf32Enabled)
std::array<int64_t, 3> gemmShape{m, n, k};		return failure();

NVVM::MMATypes ptxTypeA;		FailureOr<NVVM::MMATypes> ptxTypeA = getNvvmMmaType(aType);
NVVM::MMATypes ptxTypeB;		if (failed(ptxTypeA))
		return op->emitOpError("failed to deduce operand PTX types");
		FailureOr<NVVM::MMATypes> ptxTypeB = getNvvmMmaType(bType);
		if (failed(ptxTypeB))
		return op->emitOpError("failed to deduce operand PTX types");
Optional<NVVM::MMATypes> ptxTypeC = NVVM::MmaOp::inferOperandMMAType(		Optional<NVVM::MMATypes> ptxTypeC = NVVM::MmaOp::inferOperandMMAType(
cType.getElementType(), /isAccumulator=/true);		cType.getElementType(), /isAccumulator=/true);
if (!ptxTypeC)		if (!ptxTypeC)
return op->emitError(		return op->emitError(
"could not infer the PTX type for the accumulator/result");		"could not infer the PTX type for the accumulator/result");

// Tensor Cores (mma.sync) on F32 works only with TensorFloat32 (TF32).		// TODO: add an attribute to the op to customize this behavior.
bool tf32Enabled = op->hasAttr(op.getTf32EnabledAttrName());
if (aType.getElementType().isF32() && !tf32Enabled)
return failure();

Optional<NVVM::MMAIntOverflow> overflow(llvm::None);		Optional<NVVM::MMAIntOverflow> overflow(llvm::None);
if (aType.getElementType().isInteger(8)) {		if (aType.getElementType().isa<IntegerType>())
ptxTypeA = NVVM::MMATypes::s8;
ptxTypeB = NVVM::MMATypes::s8;
overflow = NVVM::MMAIntOverflow::satfinite;
} else if (aType.getElementType().isInteger(4)) {
ptxTypeA = NVVM::MMATypes::s4;
ptxTypeB = NVVM::MMATypes::s4;
overflow = NVVM::MMAIntOverflow::satfinite;		overflow = NVVM::MMAIntOverflow::satfinite;
} else if (aType.getElementType().isF16()) {
ptxTypeA = NVVM::MMATypes::f16;
ptxTypeB = NVVM::MMATypes::f16;
} else if (aType.getElementType().isF64()) {
ptxTypeA = NVVM::MMATypes::f64;
ptxTypeB = NVVM::MMATypes::f64;
} else if (aType.getElementType().isF32()) {
ptxTypeA = NVVM::MMATypes::tf32;
ptxTypeB = NVVM::MMATypes::tf32;
} else {
return op->emitError("could not deduce operand PTX types");
}

SmallVector<Value> matA =		SmallVector<Value> matA =
unpackOperandVector(rewriter, loc, adaptor.getMatrixA(), ptxTypeA);		unpackOperandVector(rewriter, loc, adaptor.getMatrixA(), *ptxTypeA);
SmallVector<Value> matB =		SmallVector<Value> matB =
unpackOperandVector(rewriter, loc, adaptor.getMatrixB(), ptxTypeB);		unpackOperandVector(rewriter, loc, adaptor.getMatrixB(), *ptxTypeB);
SmallVector<Value> matC =		SmallVector<Value> matC =
unpackOperandVector(rewriter, loc, adaptor.getMatrixC(), *ptxTypeC);		unpackOperandVector(rewriter, loc, adaptor.getMatrixC(), *ptxTypeC);

Type desiredRetTy = typeConverter->convertType(op->getResultTypes()[0]);		Type desiredRetTy = typeConverter->convertType(op->getResultTypes()[0]);
Type intrinsicResTy = inferIntrinsicResultType(		Type intrinsicResTy = inferIntrinsicResultType(
typeConverter->convertType(op->getResultTypes()[0]));		typeConverter->convertType(op->getResultTypes()[0]));
Value intrinsicResult = rewriter.create<NVVM::MmaOp>(		Value intrinsicResult = rewriter.create<NVVM::MmaOp>(
op.getLoc(), intrinsicResTy, matA, matB, matC,		op.getLoc(), intrinsicResTy, matA, matB, matC,
/shape=/gemmShape,		/shape=/gemmShape,
/b1Op=/llvm::None,		/b1Op=/llvm::None,
/intOverflow=/overflow,		/intOverflow=/overflow,
/multiplicandPtxTypes=/		/multiplicandPtxTypes=/
std::array<NVVM::MMATypes, 2>{ptxTypeA, ptxTypeB},		std::array<NVVM::MMATypes, 2>{ptxTypeA, ptxTypeB},
/multiplicandLayouts=/		/multiplicandLayouts=/
std::array<NVVM::MMALayout, 2>{NVVM::MMALayout::row,		std::array<NVVM::MMALayout, 2>{NVVM::MMALayout::row,
NVVM::MMALayout::col});		NVVM::MMALayout::col});
rewriter.replaceOp(op, convertIntrinsicResult(op.getLoc(), intrinsicResTy,		rewriter.replaceOp(op, convertIntrinsicResult(op.getLoc(), intrinsicResTy,
desiredRetTy, intrinsicResult,		desiredRetTy, intrinsicResult,
rewriter));		rewriter));
return success();		return success();
}		}
Show All 38 Lines	static void emitCpAsyncOpZfillAsm(Location loc, Value dstPtr, Value srcPtr,
Value srcElementsI32 =		Value srcElementsI32 =
rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), srcElements);		rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), srcElements);
Value srcBytes = rewriter.create<LLVM::LShrOp>(		Value srcBytes = rewriter.create<LLVM::LShrOp>(
loc, rewriter.create<LLVM::MulOp>(loc, bitwidth, srcElementsI32), c3I32);		loc, rewriter.create<LLVM::MulOp>(loc, bitwidth, srcElementsI32), c3I32);

SmallVector<Value> asmVals{dstPtr, srcPtr, dstBytes, srcBytes};		SmallVector<Value> asmVals{dstPtr, srcPtr, dstBytes, srcBytes};

rewriter.create<LLVM::InlineAsmOp>(		rewriter.create<LLVM::InlineAsmOp>(
loc, LLVM::LLVMVoidType::get(rewriter.getContext()), /operands=/asmVals,		loc, LLVM::LLVMVoidType::get(rewriter.getContext()),
		/operands=/asmVals,
/asm_string=/asmStr,		/asm_string=/asmStr,
/constraints=/asmConstraints, /has_side_effects=/true,		/constraints=/asmConstraints, /has_side_effects=/true,
/is_align_stack=/false, /asm_dialect=/asmDialectAttr,		/is_align_stack=/false, /asm_dialect=/asmDialectAttr,
/operand_attrs=/ArrayAttr());		/operand_attrs=/ArrayAttr());
}		}

		/// Returns the constraints for the sparse MMA inline assembly instruction.
		static std::string buildMmaSparseAsmConstraintString(unsigned matASize,
		unsigned matBSize,
		unsigned matCSize) {
		std::string str;
		llvm::raw_string_ostream ss(str);
		for (unsigned i = 0; i < matCSize; i++)
		ss << "=r,";
		for (unsigned i = 0; i < matASize + matBSize + matCSize; i++)
		ss << "r,";
		// The final two operands are for the sparsity metadata and sparsity selector.
		ss << "r,r";
		ss.flush();
		return str;
		}

		/// Returns the string for the `mma.sp.sync` instruction that corresponds to
		/// the give parameters. Note that this function doesn't do any validation,
		/// it's expected that the provided parameters correspond to a valid
		/// instruction.
		static std::string
		buildMmaSparseAsmString(const std::array<int64_t, 3> &shape, unsigned matASize,
		unsigned matBSize, unsigned matCSize,
		NVVM::MMATypes ptxTypeA, NVVM::MMATypes ptxTypeB,
		NVVM::MMATypes ptxTypeC, NVVM::MMATypes ptxTypeD,
		Optional<NVVM::MMAIntOverflow> overflow) {
		auto ptxTypeStr = [](NVVM::MMATypes ptxType) {
		return NVVM::stringifyMMATypes(ptxType);
		};

		std::string asmStr;
		llvm::raw_string_ostream ss(asmStr);
		ss << "mma.sp.sync.aligned.m" << shape[0] << "n" << shape[1] << "k"
		<< shape[2] << ".row.col.";

		if (overflow)
		ss << NVVM::stringifyMMAIntOverflow(*overflow) << ".";

		ss << ptxTypeStr(ptxTypeD) << "." << ptxTypeStr(ptxTypeA) << "."
		<< ptxTypeStr(ptxTypeB) << "." << ptxTypeStr(ptxTypeC) << " ";
		unsigned asmArgIdx = 0;

		// The operand string is structured into sections `{matC elements...},
		// {matA elements...}, {matB elements...}, {matC elements}`.
		for (const auto arrSize : {matCSize, matASize, matBSize, matCSize}) {
		ss << "{";
		for (unsigned i = 0; i < arrSize; i++)
		ss << "$" << asmArgIdx++ << (i < arrSize - 1 ? "," : "");
		ss << "},";
		}
		ss << "$" << asmArgIdx++ << ",$" << asmArgIdx++ << ";";
		ss.flush();
		return asmStr;
		}

		/// Builds an inline assembly operation corresponding to the specified MMA
		/// sparse sync operation.
		static FailureOr<LLVM::InlineAsmOp> emitMmaSparseSyncOpAsm(
		manishucsdUnsubmitted Done Reply Inline Actions Is it ok to move all inline asm (emit* functions including `emitCpAsyncOpZfillAsm`) into a its own file `NVGPUToNVASM.[h/cpp]`? Let me know how you feel about it. I think we will need a few more inline asm, that I can think of, before it becomes available in nvvm backend. We can just edit NVGPUToNVASM.* file as we add more asm and prune it when it becomes available through NVVM. manishucsd: Is it ok to move all inline asm (emit* functions including `emitCpAsyncOpZfillAsm`) into a its…
		ThomasRaouxUnsubmitted Done Reply Inline Actions I would prefer not create a file just to have inline assembly. It doesn't sound like a very natural separation of the code to me. ThomasRaoux: I would prefer not create a file just to have inline assembly. It doesn't sound like a very…
		Location loc, NVVM::MMATypes ptxTypeA, NVVM::MMATypes ptxTypeB,
		NVVM::MMATypes ptxTypeC, NVVM::MMATypes ptxTypeD,
		Optional<NVVM::MMAIntOverflow> overflow, ArrayRef<Value> unpackedAData,
		ArrayRef<Value> unpackedB, ArrayRef<Value> unpackedC, Value indexData,
		int64_t metadataSelector, const std::array<int64_t, 3> &shape,
		Type intrinsicResultType, ConversionPatternRewriter &rewriter) {
		auto asmDialectAttr = LLVM::AsmDialectAttr::get(rewriter.getContext(),
		LLVM::AsmDialect::AD_ATT);

		std::string asmStr = buildMmaSparseAsmString(
		shape, unpackedAData.size(), unpackedB.size(), unpackedC.size(), ptxTypeA,
		ptxTypeB, ptxTypeC, ptxTypeD, overflow);
		std::string constraintStr = buildMmaSparseAsmConstraintString(
		unpackedAData.size(), unpackedB.size(), unpackedC.size());

		Value selectorVal = rewriter.create<LLVM::ConstantOp>(
		loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(metadataSelector));

		SmallVector<Value> asmVals;
		asmVals.reserve(unpackedAData.size() + unpackedB.size() + unpackedC.size() +
		2);
		for (ArrayRef<Value> args : {unpackedAData, unpackedB, unpackedC})
		llvm::append_range(asmVals, args);
		asmVals.push_back(indexData);
		asmVals.push_back(selectorVal);

		return rewriter.create<LLVM::InlineAsmOp>(loc,
		/resultTypes=/intrinsicResultType,
		/operands=/asmVals,
		/asm_string=/asmStr,
		/constraints=/constraintStr,
		/has_side_effects=/true,
		/is_align_stack=/false,
		/asm_dialect=/asmDialectAttr,
		/operand_attrs=/ArrayAttr());
		}

		/// Lowers `nvgpu.mma.sp.sync` to inline assembly.
		struct NVGPUMmaSparseSyncLowering
		: public ConvertOpToLLVMPattern<nvgpu::MmaSparseSyncOp> {
		using ConvertOpToLLVMPattern<nvgpu::MmaSparseSyncOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::MmaSparseSyncOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op->getLoc();
		// Get the shapes of the MMAMatrix type being used. The shapes will
		// choose which intrinsic this op will be lowered to.
		VectorType aType = op.getMatrixA().getType();
		VectorType bType = op.getMatrixB().getType();
		VectorType cType = op.getMatrixC().getType();

		FailureOr<NVVM::MMATypes> ptxTypeA = getNvvmMmaType(aType);
		if (failed(ptxTypeA))
		return op->emitOpError("failed to deduce operand PTX types");
		FailureOr<NVVM::MMATypes> ptxTypeB = getNvvmMmaType(bType);
		if (failed(ptxTypeB))
		return op->emitOpError("failed to deduce operand PTX types");
		Optional<NVVM::MMATypes> ptxTypeC = NVVM::MmaOp::inferOperandMMAType(
		cType.getElementType(), /isAccumulator=/true);
		if (!ptxTypeC)
		return op->emitError(
		"could not infer the PTX type for the accumulator/result");

		// Same as `mma.sync`, F32 works only with TensorFloat32 (TF32).
		bool tf32Enabled = op->hasAttr(op.getTf32EnabledAttrName());
		if (aType.getElementType().isF32() && !tf32Enabled)
		return failure();

		// TODO: add an attribute to the op to customize this behavior.
		Optional<NVVM::MMAIntOverflow> overflow(llvm::None);
		if (aType.getElementType().isa<IntegerType>())
		overflow = NVVM::MMAIntOverflow::satfinite;

		SmallVector<Value> matA =
		unpackOperandVector(rewriter, loc, adaptor.getMatrixA(), *ptxTypeA);
		SmallVector<Value> matB =
		unpackOperandVector(rewriter, loc, adaptor.getMatrixB(), *ptxTypeB);
		SmallVector<Value> matC =
		unpackOperandVector(rewriter, loc, adaptor.getMatrixC(), *ptxTypeC);

		Type desiredRetTy = typeConverter->convertType(op->getResultTypes()[0]);
		Type intrinsicResTy = inferIntrinsicResultType(
		typeConverter->convertType(op->getResultTypes()[0]));

		// Bitcast the sparse metadata from vector<2xf16> to an i32.
		Value sparseMetadata = adaptor.getSparseMetadata();
		if (sparseMetadata.getType() !=
		LLVM::getFixedVectorType(rewriter.getI16Type(), 2))
		return op->emitOpError() << "Expected metadata type to be LLVM "
		"VectorType of 2 i16 elements";
		sparseMetadata = rewriter.create<LLVM::BitcastOp>(
		loc, rewriter.getI32Type(), sparseMetadata);

		FailureOr<LLVM::InlineAsmOp> intrinsicResult = emitMmaSparseSyncOpAsm(
		loc, ptxTypeA, ptxTypeB, ptxTypeC, ptxTypeC, overflow, matA, matB,
		matC, sparseMetadata, op.getSparsitySelector(), op.getMmaShapeAsArray(),
		intrinsicResTy, rewriter);
		if (failed(intrinsicResult))
		return failure();

		assert((*intrinsicResult).getNumResults() == 1 &&
		"expected inline asm op returns a single LLVM struct type");
		rewriter.replaceOp(
		op, convertIntrinsicResult(op.getLoc(), intrinsicResTy, desiredRetTy,
		(*intrinsicResult)->getResult(0), rewriter));
		return success();
		}
		};

struct NVGPUAsyncCopyLowering		struct NVGPUAsyncCopyLowering
: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCopyOp> {		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCopyOp> {
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
nvgpu::DeviceAsyncCopyOp>::ConvertOpToLLVMPattern;		nvgpu::DeviceAsyncCopyOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	struct NVGPUAsyncWaitLowering
}		}
};		};

} // namespace		} // namespace

void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns) {		RewritePatternSet &patterns) {
patterns.add<MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,		patterns.add<MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering>(		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
converter);		NVGPUMmaSparseSyncLowering>(converter);
}		}

std::unique_ptr<Pass> mlir::createConvertNVGPUToNVVMPass() {		std::unique_ptr<Pass> mlir::createConvertNVGPUToNVVMPass() {
return std::make_unique<ConvertNVGPUToNVVMPass>();		return std::make_unique<ConvertNVGPUToNVVMPass>();
}		}

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

	//===- NVGPUDialect.cpp - MLIR NVGPU ops implementation -------------------===//			//===- NVGPUDialect.cpp - MLIR NVGPU ops implementation -------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the NVGPU dialect and its operations.			// This file implements the NVGPU dialect and its operations.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"			#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
	#include "mlir/Dialect/GPU/IR/GPUDialect.h"			#include "mlir/Dialect/GPU/IR/GPUDialect.h"
	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
				#include "mlir/IR/BuiltinAttributes.h"
	#include "mlir/IR/DialectImplementation.h"			#include "mlir/IR/DialectImplementation.h"
	#include "mlir/IR/OpImplementation.h"			#include "mlir/IR/OpImplementation.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
				#include "mlir/IR/Verifier.h"
	#include "llvm/ADT/TypeSwitch.h"			#include "llvm/ADT/TypeSwitch.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::nvgpu;			using namespace mlir::nvgpu;

	void nvgpu::NVGPUDialect::initialize() {			void nvgpu::NVGPUDialect::initialize() {
	addTypes<			addTypes<
	#define GET_TYPEDEF_LIST			#define GET_TYPEDEF_LIST
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	void MmaSyncOp::build(::mlir::OpBuilder &odsBuilder,			void MmaSyncOp::build(::mlir::OpBuilder &odsBuilder,
	::mlir::OperationState &odsState, Value matrixA,			::mlir::OperationState &odsState, Value matrixA,
	Value matrixB, Value matrixC, ArrayAttr mmaShape) {			Value matrixB, Value matrixC, ArrayAttr mmaShape) {
	build(odsBuilder, odsState, matrixC.getType(), matrixA, matrixB, matrixC,			build(odsBuilder, odsState, matrixC.getType(), matrixA, matrixB, matrixC,
	mmaShape, UnitAttr());			mmaShape, UnitAttr());
	}			}

	LogicalResult MmaSyncOp::verify() {			/// Performs verification for MmaSyncOp and MmaSparseSyncOp.
				static LogicalResult verifyMmaSyncOp(Operation *op,
	// Fundamental tensor core mma.sync op			TypedValue<VectorType> matrixA,
	// For F32 (TF32), F16, S8, and S4 data types fundamental tensor core			TypedValue<VectorType> matrixB,
	// operation is of shape: 8-by-8-by-128b. F64 is an exception. The			TypedValue<VectorType> matrixC,
	// verification for mma.sync covering various shapes and data types is based			const std::array<int64_t, 3> &mmaShape,
	// on the fundamental tensor core operionation.			bool tf32Enabled, bool sparse = false) {

				// The verification for mma.sync covering various shapes and data types is
				// based on the fundamental tensor core shape.

				// "Fundamental" tensor core shapes:
				// - For F32 (TF32), F16, S8, and S4 data
				// types the fundamental tensor core operation is of shape 8-by-8-by-128b.
				// - F64 is an exception and is of shape 8-by-8-by-256b.
	constexpr int kThreads = 32; // 32 threads per warp			constexpr int kThreads = 32; // 32 threads per warp
	int64_t shapeM = 8;			int64_t shapeM = 8;
	int64_t shapeN = 8;			int64_t shapeN = 8;
	int64_t shapeK; // set based on data type (128b for all data types except F64)			int64_t shapeK; // set based on data type (128b for all data types except F64)

	// Number of elements A, B, and C per thread per fundamental tensor core tile			// Number of elements A, B, and C per thread per fundamental tensor core tile
	int64_t numElementA; // set based on data type (32b except F64)			int64_t numElementA; // set based on data type (32b except F64)
	int64_t numElementB; // set based on data type (32b except F64)			int64_t numElementB; // set based on data type (32b except F64)
	int64_t numElementC{2}; // two accumulator elements per fundamental tile			int64_t numElementC{2}; // two accumulator elements per fundamental tile

	// nvgpu.mma.sync vector operands (per thread)			// nvgpu.mma.sync vector operands (per thread)
	auto aVector = getMatrixA().getType().cast<VectorType>();			auto aVector = matrixA.getType();
	auto bVector = getMatrixB().getType().cast<VectorType>();			auto bVector = matrixB.getType();
	auto cVector = getMatrixC().getType().cast<VectorType>();			auto cVector = matrixC.getType();

	// vector shapes			// vector shapes
	ArrayRef<int64_t> aShape = aVector.getShape();			ArrayRef<int64_t> aShape = aVector.getShape();
	ArrayRef<int64_t> bShape = bVector.getShape();			ArrayRef<int64_t> bShape = bVector.getShape();
	ArrayRef<int64_t> cShape = cVector.getShape();			ArrayRef<int64_t> cShape = cVector.getShape();

	// vector element type			// vector element type
	Type aType = aVector.getElementType();			Type aType = aVector.getElementType();

	// tensor float32 (TF32) enabled			// Certain data types are not allowed in sparse mode.
	bool tf32Enabled = getOperation()->hasAttr(getTf32EnabledAttrName());			if (sparse && aType.isF64())
				return op->emitError() << "f64 is not supported for sparse mode";
	// nvgpu.mma.sync shape (per 32 threads or per warp)
	int64_t m = getMmaShape()[0].cast<IntegerAttr>().getInt();
	int64_t n = getMmaShape()[1].cast<IntegerAttr>().getInt();
	int64_t k = getMmaShape()[2].cast<IntegerAttr>().getInt();

	if (aType.isF64()) {			if (aType.isF64()) {
	// exception to 8-by-8-128b fundamental tensor core tile size			// exception to 8-by-8-128b fundamental tensor core tile size
	shapeK = 4;			shapeK = 4;
	numElementA = 1;			numElementA = 1;
	numElementB = 1;			numElementB = 1;
	} else if (aType.isF32() \|\| aType.isBF16() \|\| aType.isF16() \|\|			} else if (aType.isF32() \|\| aType.isBF16() \|\| aType.isF16() \|\|
	aType.isInteger(8) \|\| aType.isInteger(4)) {			aType.isInteger(8) \|\| aType.isInteger(4)) {
	// 8-by-8-128b fundamental tensor core tile size			// 8-by-8-128b fundamental tensor core tile size
	int operandBitwidth = aType.getIntOrFloatBitWidth();			int operandBitwidth = aType.getIntOrFloatBitWidth();
	shapeK = 128 / operandBitwidth; // 128b wide shapeK			shapeK = 128 / operandBitwidth; // 128b wide shapeK

	numElementA = 32 / operandBitwidth; // 32b wide operand A			numElementA = 32 / operandBitwidth; // 32b wide operand A
	numElementB = 32 / operandBitwidth; // 32b wide operand B			numElementB = 32 / operandBitwidth; // 32b wide operand B
	} else {			} else {
	return emitError() << "expected input data type (i4,i8,f16,bf16,tf32,f64) "			return op->emitError()
	"supported by nvgpu.mma.sync";			<< "expected input data type (i4,i8,f16,bf16,tf32,f64) "
				"supported by "
				<< op->getName();
	}			}

	//			//
	// Basic verification			// Basic verification
	//			//

				auto [m, n, k] = mmaShape;

	// verify warp-wide size for vector a			// verify warp-wide size for vector a
	if (aShape[0] * aShape[1] * kThreads != m * k)			int64_t sparseFactor = sparse ? 2 : 1;
	return emitOpError() << "expected " << m * k			if (aShape[0] * aShape[1] * kThreads != m * k / sparseFactor)
	<< " warp-wide matrix A elements";			return op->emitOpError()
				<< "expected " << m * k << " warp-wide matrix A elements";

	// verify warp-wide size for vector b			// verify warp-wide size for vector b
	if (bShape[0] * bShape[1] * kThreads != k * n)			if (bShape[0] * bShape[1] * kThreads != k * n)
	return emitOpError() << "expected " << k * n			return op->emitOpError()
	<< " warp-wide matrix B elements";			<< "expected " << k * n << " warp-wide matrix B elements";

	// verify warp-wide size for vector c			// verify warp-wide size for vector c
	if (cShape[0] * cShape[1] * kThreads != m * n)			if (cShape[0] * cShape[1] * kThreads != m * n)
	return emitOpError() << "expected " << m * n			return op->emitOpError()
	<< " warp-wide matrix C elements";			<< "expected " << m * n << " warp-wide matrix C elements";

	// verify tf32 tensor cores are enabled for only F32 datatype			// verify tf32 tensor cores are enabled for only F32 datatype
	if (tf32Enabled && !(aType.isF32()))			if (tf32Enabled && !(aType.isF32()))
	return emitOpError() << "expected tf32 tensor cores only for F32 operands";			return op->emitOpError()
				<< "expected tf32 tensor cores only for F32 operands";

	//			//
	// Extended verification			// Extended verification
	//			//

	// tiles of fundamental tensor core operations			// tiles of fundamental tensor core operations
	int64_t mTile = m / shapeM;			int64_t mTile = m / shapeM;
	int64_t nTile = n / shapeN;			int64_t nTile = n / shapeN;
	int64_t kTile = k / shapeK;			int64_t kTile = k / shapeK;

	// verify shape of aVector			// verify shape of aVector
	if ((aShape[0] != mTile * kTile) \|\| (aShape[1] != numElementA))			if ((aShape[0] != mTile * kTile / (sparse ? 2 : 1)) \|\|
	return emitOpError() << "expected matrix A to be shaped (" << mTile * kTile			(aShape[1] != numElementA))
	<< " x " << numElementA << ")";			return op->emitOpError() << "expected matrix A to be shaped ("
				<< mTile * kTile << " x " << numElementA << ")";

	// verify shape of bVector			// verify shape of bVector
	if ((bShape[0] != kTile * nTile) \|\| (bShape[1] != numElementB))			if ((bShape[0] != kTile * nTile) \|\| (bShape[1] != numElementB))
	return emitOpError() << "expected matrix B to be shaped (" << kTile * nTile			return op->emitOpError() << "expected matrix B to be shaped ("
	<< " x " << numElementB << ")";			<< kTile * nTile << " x " << numElementB << ")";

	// verify shape of cVector			// verify shape of cVector
	if ((cShape[0] != mTile * nTile) \|\| (cShape[1] != numElementC))			if ((cShape[0] != mTile * nTile) \|\| (cShape[1] != numElementC))
	return emitOpError() << "expected matrix C to be shaped (" << mTile * nTile			return op->emitOpError() << "expected matrix C to be shaped ("
	<< " x " << numElementC << ")";			<< mTile * nTile << " x " << numElementC << ")";

	return success();			return success();
	}			}

				LogicalResult MmaSyncOp::verify() {
				return verifyMmaSyncOp(this->getOperation(), getMatrixA(), getMatrixB(),
				getMatrixC(), getMmaShapeAsArray(),
				getOperation()->hasAttr(getTf32EnabledAttrName()));
				}

				//===----------------------------------------------------------------------===//
				// NVGPU_MmaSparseSyncOp
				//===----------------------------------------------------------------------===//
				void MmaSparseSyncOp::build(::mlir::OpBuilder &odsBuilder,
				::mlir::OperationState &odsState, Value matrixA,
				Value matrixB, Value matrixC, Value sparseMetadata,
				ArrayRef<int64_t> mmaShape) {
				build(odsBuilder, odsState, matrixC.getType(), matrixA, matrixB, matrixC,
				sparseMetadata, odsBuilder.getI64ArrayAttr(mmaShape), 0, UnitAttr());
				}

				LogicalResult MmaSparseSyncOp::verify() {
				return verifyMmaSyncOp(this->getOperation(), getMatrixA(), getMatrixB(),
				getMatrixC(), getMmaShapeAsArray(),
				getOperation()->hasAttr(getTf32EnabledAttrName()),
				true);
				}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// NVGPU_LdMatrixOp			// NVGPU_LdMatrixOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	LogicalResult LdMatrixOp::verify() {			LogicalResult LdMatrixOp::verify() {

	// ldmatrix reads data from source in shared memory			// ldmatrix reads data from source in shared memory
	auto srcMemref = getSrcMemref().getType().cast<MemRefType>();			auto srcMemref = getSrcMemref().getType().cast<MemRefType>();

	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

Show First 20 Lines • Show All 307 Lines • ▼ Show 20 Lines	func.func @async_cp_zfill(
%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>		%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4, %srcElements {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group		// CHECK: nvvm.cp.async.commit.group
%1 = nvgpu.device_async_create_group %0		%1 = nvgpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1		// CHECK: nvvm.cp.async.wait.group 1
nvgpu.device_async_wait %1 { numGroups = 1 : i32 }		nvgpu.device_async_wait %1 { numGroups = 1 : i32 }

return		return
}		}

		// -----

		// CHECK-LABEL: func @mma_sp_sync_f16_16832(
		func.func @mma_sp_sync_f16_16832(%arg0: vector<4x2xf16>,
		%arg1: vector<4x2xf16>,
		%arg2: vector<2x2xf16>,
		%arg3: vector<2xi16>) -> vector<2x2xf16> {
		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[2] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[3] : !llvm.array<4 x vector<2xf16>>

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[2] : !llvm.array<4 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[3] : !llvm.array<4 x vector<2xf16>>

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<2 x vector<2xf16>>

		// CHECK-NOT llvm.extractvalue

		// CHECK: %[[sparseMetadata:.+]] = llvm.bitcast %{{.+}} : vector<2xi16> to i32
		// CHECK: %[[sparseSelector:.+]] = llvm.mlir.constant(0 : i32) : i32

		// CHECK: %[[d:.+]] = llvm.inline_asm has_side_effects asm_dialect = att
		// CHECK-SAME: "mma.sp.sync.aligned.m16n8k32.row.col.f16.f16.f16.f16 {$0,$1},{$2,$3,$4,$5},{$6,$7,$8,$9},{$10,$11},$12,$13;"
		// CHECK-SAME: "=r,=r,r,r,r,r,r,r,r,r,r,r,r,r"
		// CHECK-SAME: %[[sparseMetadata]], %[[sparseSelector]] :
		// CHECK-SAME: -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 32]} :
		(vector<4x2xf16>, vector<4x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>

		// CHECK-DAG: llvm.extractvalue %[[d]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>)>
		// CHECK-DAG: llvm.extractvalue %[[d]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>)>
		// CHECK: llvm.mlir.undef : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.insertvalue %{{.+}}, %{{.+}}[0] : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.insertvalue %{{.+}}, %{{.+}}[1] : !llvm.array<2 x vector<2xf16>>
		return %d : vector<2x2xf16>
		}

		// -----

		// CHECK-LABEL: func @mma_sp_sync_f16_16816(
		func.func @mma_sp_sync_f16_16816(%arg0: vector<2x2xf16>,
		%arg1: vector<2x2xf16>,
		%arg2: vector<2x2xf16>,
		%arg3: vector<2xi16>) -> vector<2x2xf16> {

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<2 x vector<2xf16>>

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<2 x vector<2xf16>>

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<2 x vector<2xf16>>

		// CHECK-NOT llvm.extractvalue

		// CHECK: %[[sparseMetadata:.+]] = llvm.bitcast %{{.+}} : vector<2xi16> to i32
		// CHECK: %[[sparseSelector:.+]] = llvm.mlir.constant(0 : i32) : i32

		// CHECK: %[[d:.+]] = llvm.inline_asm has_side_effects asm_dialect = att
		// CHECK-SAME: "mma.sp.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 {$0,$1},{$2,$3},{$4,$5},{$6,$7},$8,$9;"
		// CHECK-SAME: "=r,=r,r,r,r,r,r,r,r,r"
		// CHECK-SAME: %[[sparseMetadata]], %[[sparseSelector]] :
		// CHECK-SAME: -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 16]} :
		(vector<2x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		return %d : vector<2x2xf16>
		}

		// -----

		// CHECK-LABEL: func @mma_sp_sync_i8_16864(
		func.func @mma_sp_sync_i8_16864(%arg0: vector<4x4xi8>,
		%arg1: vector<4x4xi8>,
		%arg2: vector<2x2xi32>,
		%arg3: vector<2xi16>) -> vector<2x2xi32> {

		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<4xi8>>
		// CHECK: llvm.bitcast %{{.+}} : vector<4xi8> to i32
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<4xi8>>
		// CHECK: llvm.bitcast %{{.+}} : vector<4xi8> to i32
		// CHECK: llvm.extractvalue %{{.*}}[2] : !llvm.array<4 x vector<4xi8>>
		// CHECK: llvm.bitcast %{{.+}} : vector<4xi8> to i32
		// CHECK: llvm.extractvalue %{{.*}}[3] : !llvm.array<4 x vector<4xi8>>


		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<4xi8>>
		// CHECK: llvm.bitcast %{{.+}} : vector<4xi8> to i32
		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<4xi8>>
		// CHECK: llvm.bitcast %{{.+}} : vector<4xi8> to i32

		// CHECK: llvm.extractvalue %{{.}}[{{.}}] : !llvm.array<2 x vector<2xi32>>
		// CHECK: llvm.extractvalue %{{.}}[{{.}}] : !llvm.array<2 x vector<2xi32>>

		// CHECK-NOT llvm.extractvalue

		// CHECK: %[[sparseMetadata:.+]] = llvm.bitcast %{{.+}} : vector<2xi16> to i32
		// CHECK: %[[sparseSelector:.+]] = llvm.mlir.constant(0 : i32) : i32

		// CHECK: %[[d:.+]] = llvm.inline_asm has_side_effects asm_dialect = att
		// CHECK-SAME: "mma.sp.sync.aligned.m16n8k64.row.col.satfinite.s32.s8.s8.s32
		// CHECK-SAME: "=r,=r,=r,=r,r,r,r,r,r,r,r,r,r,r,r,r,r,r"
		// CHECK-SAME: %[[sparseMetadata]], %[[sparseSelector]] :
		// CHECK-SAME: -> !llvm.struct<(i32, i32, i32, i32)

		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 64]} :
		(vector<4x4xi8>, vector<4x4xi8>, vector<2x2xi32>) -> vector<2x2xi32>
		return %d : vector<2x2xi32>
		}

mlir/test/Dialect/NVGPU/roundtrip.mlir

Show All 13 Lines	func.func @mma_sync(%arg0: vector<4x2xf16>,
%arg1: vector<2x2xf16>,		%arg1: vector<2x2xf16>,
%arg2: vector<2x2xf16>) -> vector<2x2xf16> {		%arg2: vector<2x2xf16>) -> vector<2x2xf16> {
// CHECK: nvgpu.mma.sync(%{{.}}, %{{.}}, %{{.*}}) {mmaShape = [16, 8, 16]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>		// CHECK: nvgpu.mma.sync(%{{.}}, %{{.}}, %{{.*}}) {mmaShape = [16, 8, 16]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
%d = nvgpu.mma.sync(%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} :		%d = nvgpu.mma.sync(%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>		(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
return %d : vector<2x2xf16>		return %d : vector<2x2xf16>
}		}

		// CHECK-LABEL: func @mma_sp_sync_f16_16832(
		func.func @mma_sp_sync_f16_16832(%arg0: vector<4x2xf16>,
		%arg1: vector<4x2xf16>,
		%arg2: vector<2x2xf16>,
		%arg3: vector<2xi16>) -> vector<2x2xf16> {
		// CHECK: nvgpu.mma.sp.sync(%{{.}}, %{{.}}, %{{.*}}) metadata(%{{.+}}) {
		// CHECK-SAME: mmaShape = [16, 8, 32]
		// CHECK-SAME: (vector<4x2xf16>, vector<4x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 32]} :
		(vector<4x2xf16>, vector<4x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		return %d : vector<2x2xf16>
		}

		// CHECK-LABEL: func @mma_sp_sync_f16_16816(
		func.func @mma_sp_sync_f16_16816(%arg0: vector<2x2xf16>,
		%arg1: vector<2x2xf16>,
		%arg2: vector<2x2xf16>,
		%arg3: vector<2xi16>) -> vector<2x2xf16> {
		// CHECK: nvgpu.mma.sp.sync(%{{.}}, %{{.}}, %{{.*}}) metadata(%{{.+}}) {
		// CHECK-SAME: mmaShape = [16, 8, 16]
		// CHECK-SAME: (vector<2x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 16]} :
		(vector<2x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
		return %d : vector<2x2xf16>
		}

		// CHECK-LABEL: func @mma_sp_sync_i8_16864(
		func.func @mma_sp_sync_i8_16864(%arg0: vector<4x4xi8>,
		%arg1: vector<4x4xi8>,
		%arg2: vector<2x2xi32>,
		%arg3: vector<2xi16>) -> vector<2x2xi32> {
		// CHECK: nvgpu.mma.sp.sync(%{{.}}, %{{.}}, %{{.*}}) metadata(%{{.+}}) {
		// CHECK-SAME: mmaShape = [16, 8, 64]
		// CHECK-SAME: (vector<4x4xi8>, vector<4x4xi8>, vector<2x2xi32>) -> vector<2x2xi32>
		%d = nvgpu.mma.sp.sync(%arg0, %arg1, %arg2) metadata(%arg3) {mmaShape = [16, 8, 64]} :
		(vector<4x4xi8>, vector<4x4xi8>, vector<2x2xi32>) -> vector<2x2xi32>
		return %d : vector<2x2xi32>
		}

func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){		func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
// CHECK-LABEL: func @async_cp		// CHECK-LABEL: func @async_cp
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
// CHECK: nvgpu.device_async_copy %{{.}}[{{.}}, {{.}}], %{{.}}[{{.}}, {{.}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>		// CHECK: nvgpu.device_async_copy %{{.}}[{{.}}, {{.}}], %{{.}}[{{.}}, {{.}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>		%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
// CHECK: %{{.*}} = nvgpu.device_async_create_group		// CHECK: %{{.*}} = nvgpu.device_async_create_group
%token = nvgpu.device_async_create_group %0		%token = nvgpu.device_async_create_group %0
// CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}		// CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
nvgpu.device_async_wait %token {numGroups = 1 : i32}		nvgpu.device_async_wait %token {numGroups = 1 : i32}
return		return
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU] Add support for structured sparsity MMA variantsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 473692

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

mlir/test/Dialect/NVGPU/roundtrip.mlir

[mlir][NVGPU] Add support for structured sparsity MMA variants
ClosedPublic