Diff 449152

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	def NVGPU_MmaSyncOp : NVGPU_Op<"mma.sync", [

Example:		Example:

```mlir		```mlir
nvgpu.mma.sync (%a, %b, %c) :		nvgpu.mma.sync (%a, %b, %c) :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>		(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
```		```
}];		}];
let arguments = (ins AnyVector:$matrixA, AnyVector:$matrixB,		let arguments = (ins AnyVector:$matrixA,
AnyVector:$matrixC, I64ArrayAttr:$mmaShape);		AnyVector:$matrixB,
		AnyVector:$matrixC,
		I64ArrayAttr:$mmaShape,
		OptionalAttr<UnitAttr>:$tf32Enabled
		christopherbateUnsubmitted Not Done Reply Inline Actions I think just putting`UnitAttr` without the `OptionalAttr` should be sufficient. christopherbate: I think just putting`UnitAttr` without the `OptionalAttr` should be sufficient.
		manishucsdAuthorUnsubmitted Done Reply Inline Actions `tf32Enabled` has no meaning for i8, f16, bf16, f64. Thus, we decided to put an Optional attribute. It's presence has meaning and acceptable for f32 datatype. manishucsd: `tf32Enabled` has no meaning for i8, f16, bf16, f64. Thus, we decided to put an Optional…
		);

let results = (outs AnyVector:$res);		let results = (outs AnyVector:$res);

		let builders = [
		OpBuilder<(ins "Value":$matrixA,
		"Value":$matrixB,
		"Value":$matrixC,
		"ArrayAttr":$mmaShape)>
		];

let assemblyFormat = [{		let assemblyFormat = [{
`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict		`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)		`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
}];		}];

let hasVerifier = 1;		let hasVerifier = 1;
}		}

▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h

	Show All 13 Lines
	#define MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_			#define MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_

	#include "mlir/IR/Operation.h"			#include "mlir/IR/Operation.h"
	#include "mlir/Support/LogicalResult.h"			#include "mlir/Support/LogicalResult.h"

	namespace mlir {			namespace mlir {
	namespace nvgpu {			namespace nvgpu {

				///
				/// Passes
				///

	/// Optimizes vectorized accesses to a shared memory buffer specified by			/// Optimizes vectorized accesses to a shared memory buffer specified by
	/// memrefValue. This transformation assumes the following:			/// memrefValue. This transformation assumes the following:
	/// 1) All relevant accesses to `memrefValue` are contained with `parentOp`.			/// 1) All relevant accesses to `memrefValue` are contained with `parentOp`.
	/// 2) The function will fail precondition checks if any subviews are			/// 2) The function will fail precondition checks if any subviews are
	/// taken of `memrefValue`. All reads/writes to `memrefValue` should occur			/// taken of `memrefValue`. All reads/writes to `memrefValue` should occur
	/// through `memrefValue` directly.			/// through `memrefValue` directly.
	///			///
	/// Shared memory bank conflicts occur when multiple threads attempt to read or			/// Shared memory bank conflicts occur when multiple threads attempt to read or
	/// write locations assigned to the same shared memory bank. For `2^N` byte			/// write locations assigned to the same shared memory bank. For `2^N` byte
	/// vectorized accesses, we need to be concerned with conflicts among threads			/// vectorized accesses, we need to be concerned with conflicts among threads
	/// identified as `(tid) -> tid.floordiv(2^{7-N})`. As such, this transformation			/// identified as `(tid) -> tid.floordiv(2^{7-N})`. As such, this transformation
	/// changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix,			/// changes any indexed memory access (vector.load, memref.load, nvgpu.ldmatrix,
	/// etc) such that the final dimension's index value is permuted such that			/// etc) such that the final dimension's index value is permuted such that
	/// `newColIndex = oldColIndex % vectorSize +			/// `newColIndex = oldColIndex % vectorSize +
	/// perm[rowIndex](oldColIndex/vectorSize, rowIndex)` where `rowIndex` is the			/// perm[rowIndex](oldColIndex/vectorSize, rowIndex)` where `rowIndex` is the
	/// index for the second-to last dimension and `perm[rowIndex]` is a permutation			/// index for the second-to last dimension and `perm[rowIndex]` is a permutation
	/// function that depends on the row Index. The permutation function is chosen			/// function that depends on the row Index. The permutation function is chosen
	/// to ensure that sequential distributed+vectorized reads/writes down a single			/// to ensure that sequential distributed+vectorized reads/writes down a single
	/// dimension of the memref have minimal conflicts.			/// dimension of the memref have minimal conflicts.
	mlir::LogicalResult optimizeSharedMemoryReadsAndWrites(Operation *parentOp,			mlir::LogicalResult optimizeSharedMemoryReadsAndWrites(Operation *parentOp,
	Value memrefValue);			Value memrefValue);

				///
				/// Rewrites patterns
				///

				//===----------------------------------------------------------------------===//
				// NVGPU transformation options exposed as auxiliary structs.
				//===----------------------------------------------------------------------===//
				/// Enum to control the lowering of `nvgpu.mmasync`.
				enum class MmaSyncF32Lowering { TF32 = 0, TF32x3 = 1, Unkown = 2 };

				/// Collect patterns to convert mma.sync on f32 input and rewrite
				/// to use tensor cores with user provided level of accuracy:
				ThomasRaouxUnsubmitted Done Reply Inline Actions I think the struct is a bit of an overkill, passing directly the enum should be good enough. ThomasRaoux: I think the struct is a bit of an overkill, passing directly the enum should be good enough.
				manishucsdAuthorUnsubmitted Done Reply Inline Actions I am in favor or keeping the struct for `MmaSyncTransformOption`s and enums for specific paths for F32 lowering. We could have more MmaSyncTranfrom enums in the future. manishucsd: I am in favor or keeping the struct for `MmaSyncTransformOption`s and enums for specific paths…
				manishucsdAuthorUnsubmitted Done Reply Inline Actions After an offline discussion, we have decide to go with just the enum for now. manishucsd: After an offline discussion, we have decide to go with just the enum for now.
				/// (a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate)
				/// (b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate)
				/// Typically, tf32 tensor core acceleration comes at a cost
				/// of accuracy from missing precision bits. While f32 has 23 precision
				/// bits, tf32 has only 10 precision bits. tf32x3 aims to recover the
				/// precision bits by spliting each operand into two tf32 values
				/// and issue three mma.sync tensor core operations.
				void populateMmaSyncF32ToTF32Patterns(
				RewritePatternSet &patterns,
				nvgpu::MmaSyncF32Lowering precision = nvgpu::MmaSyncF32Lowering::TF32);

	} // namespace nvgpu			} // namespace nvgpu
	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_			#endif // MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	matchAndRewrite(nvgpu::MmaSyncOp op, OpAdaptor adaptor,
int64_t n = op.getMmaShape()[1].cast<IntegerAttr>().getInt();		int64_t n = op.getMmaShape()[1].cast<IntegerAttr>().getInt();
int64_t k = op.getMmaShape()[2].cast<IntegerAttr>().getInt();		int64_t k = op.getMmaShape()[2].cast<IntegerAttr>().getInt();
std::array<int64_t, 3> gemmShape{m, n, k};		std::array<int64_t, 3> gemmShape{m, n, k};

NVVM::MMATypes ptxTypeA;		NVVM::MMATypes ptxTypeA;
NVVM::MMATypes ptxTypeB;		NVVM::MMATypes ptxTypeB;
Optional<NVVM::MMATypes> ptxTypeC = NVVM::MmaOp::inferOperandMMAType(		Optional<NVVM::MMATypes> ptxTypeC = NVVM::MmaOp::inferOperandMMAType(
cType.getElementType(), /isAccumulator=/true);		cType.getElementType(), /isAccumulator=/true);
if (!ptxTypeC) {		if (!ptxTypeC)
return op->emitError(		return op->emitError(
"could not infer the PTX type for the accumulator/result");		"could not infer the PTX type for the accumulator/result");
}
		// Tensor Cores (mma.sync) on F32 works only with TensorFloat32 (TF32).
		bool tf32Enabled = op->hasAttr(op.getTf32EnabledAttrName());
		if (aType.getElementType().isF32() && !tf32Enabled)
		return failure();

Optional<NVVM::MMAIntOverflow> overflow(llvm::None);		Optional<NVVM::MMAIntOverflow> overflow(llvm::None);
if (aType.getElementType().isInteger(8)) {		if (aType.getElementType().isInteger(8)) {
ptxTypeA = NVVM::MMATypes::s8;		ptxTypeA = NVVM::MMATypes::s8;
ptxTypeB = NVVM::MMATypes::s8;		ptxTypeB = NVVM::MMATypes::s8;
overflow = NVVM::MMAIntOverflow::satfinite;		overflow = NVVM::MMAIntOverflow::satfinite;
} else if (aType.getElementType().isInteger(4)) {		} else if (aType.getElementType().isInteger(4)) {
ptxTypeA = NVVM::MMATypes::s4;		ptxTypeA = NVVM::MMATypes::s4;
ptxTypeB = NVVM::MMATypes::s4;		ptxTypeB = NVVM::MMATypes::s4;
overflow = NVVM::MMAIntOverflow::satfinite;		overflow = NVVM::MMAIntOverflow::satfinite;
} else if (aType.getElementType().isF16()) {		} else if (aType.getElementType().isF16()) {
ptxTypeA = NVVM::MMATypes::f16;		ptxTypeA = NVVM::MMATypes::f16;
ptxTypeB = NVVM::MMATypes::f16;		ptxTypeB = NVVM::MMATypes::f16;
} else if (aType.getElementType().isF64()) {		} else if (aType.getElementType().isF64()) {
ptxTypeA = NVVM::MMATypes::f64;		ptxTypeA = NVVM::MMATypes::f64;
ptxTypeB = NVVM::MMATypes::f64;		ptxTypeB = NVVM::MMATypes::f64;
} else if (aType.getElementType().isF32()) {		} else if (aType.getElementType().isF32()) {
		ThomasRaouxUnsubmitted Done Reply Inline Actions I don't think we want to return an error for the case where tf32 is not enabled. We should just fail the pattern. ThomasRaoux: I don't think we want to return an error for the case where tf32 is not enabled. We should just…
ptxTypeA = NVVM::MMATypes::tf32;		ptxTypeA = NVVM::MMATypes::tf32;
ptxTypeB = NVVM::MMATypes::tf32;		ptxTypeB = NVVM::MMATypes::tf32;
} else {		} else {
return op->emitError("could not deduce operand PTX types");		return op->emitError("could not deduce operand PTX types");
}		}

SmallVector<Value> matA =		SmallVector<Value> matA =
unpackOperandVector(rewriter, loc, adaptor.getMatrixA(), ptxTypeA);		unpackOperandVector(rewriter, loc, adaptor.getMatrixA(), ptxTypeA);
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

Show First 20 Lines • Show All 681 Lines • ▼ Show 20 Lines convertContractOpToMmaSync(vector::ContractionOp op,

llvm::DenseMap<Value, Value> &valueMapping) { llvm::DenseMap<Value, Value> &valueMapping) {

OpBuilder b(op); OpBuilder b(op);

Value opA = valueMapping.find(op.getLhs())->second; Value opA = valueMapping.find(op.getLhs())->second;

Value opB = valueMapping.find(op.getRhs())->second; Value opB = valueMapping.find(op.getRhs())->second;

Value opC = valueMapping.find(op.getAcc())->second; Value opC = valueMapping.find(op.getAcc())->second;

int64_t m = op.getLhs().getType().cast<VectorType>().getShape()[0]; int64_t m = op.getLhs().getType().cast<VectorType>().getShape()[0];

int64_t n = op.getRhs().getType().cast<VectorType>().getShape()[0]; int64_t n = op.getRhs().getType().cast<VectorType>().getShape()[0];

int64_t k = op.getLhs().getType().cast<VectorType>().getShape()[1]; int64_t k = op.getLhs().getType().cast<VectorType>().getShape()[1];

Value matmul = b.create<nvgpu::MmaSyncOp>( Value matmul = b.create<nvgpu::MmaSyncOp>(op.getLoc(), opA, opB, opC,

op.getLoc(), opC.getType(), opA, opB, opC, b.getI64ArrayAttr({m, n, k})); b.getI64ArrayAttr({m, n, k}));

valueMapping[op.getResult()] = matmul; valueMapping[op.getResult()] = matmul;

christopherbateUnsubmitted

Done

b.create<nvgpu::MmaSyncOp>(op.getLoc(), opC.getType(), opA, opB, opC,

- b.getI64ArrayAttr({m, n, k}), UnitAttr());

+ b.getI64ArrayAttr({m, n, k}), /*tf32Enabled=*/UnitAttr());

valueMapping[op.getResult()] = matmul;

It is your intent to enable by default currently?

christopherbate: It is your intent to enable by default currently?

ThomasRaouxUnsubmitted

Done

It would be nice to add a builder function where we don't need to pass the extra attribute and that would just set it to false.

UnitAttr() should not enable it I believe? I don't think we want VectorToGPU to make this decision.

ThomasRaoux: It would be nice to add a builder function where we don't need to pass the extra attribute and…

return success(); return success();

} }

/// Convert a 2D splat ConstantOp to a SubgroupMmaConstantMatrix op. /// Convert a 2D splat ConstantOp to a SubgroupMmaConstantMatrix op.

static void convertConstantOp(arith::ConstantOp op, static void convertConstantOp(arith::ConstantOp op,

llvm::DenseMap<Value, Value> &valueMapping) { llvm::DenseMap<Value, Value> &valueMapping) {

assert(constantSupportsMMAMatrixType(op)); assert(constantSupportsMMAMatrixType(op));

OpBuilder b(op); OpBuilder b(op);

▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines

return emitOpError() << "expected " << dstMemref.getRank()

<< " destination indices, got "

<< getDstIndices().size();

return success();

}

//===----------------------------------------------------------------------===//

// NVGPU_MmaSyncOp

//===----------------------------------------------------------------------===//

void MmaSyncOp::build(::mlir::OpBuilder &odsBuilder,

::mlir::OperationState &odsState, Value matrixA,

Value matrixB, Value matrixC, ArrayAttr mmaShape) {

build(odsBuilder, odsState, matrixC.getType(), matrixA, matrixB, matrixC,

mmaShape, UnitAttr());

}

LogicalResult MmaSyncOp::verify() {

// Fundamental tensor core mma.sync op

// For F32 (TF32), F16, S8, and S4 data types fundamental tensor core

// operation is of shape: 8-by-8-by-128b. F64 is an exception. The

// verification for mma.sync covering various shapes and data types is based

// on the fundamental tensor core operionation.

Show All 15 Lines

LogicalResult MmaSyncOp::verify() {

// vector shapes

ArrayRef<int64_t> aShape = aVector.getShape();

ArrayRef<int64_t> bShape = bVector.getShape();

ArrayRef<int64_t> cShape = cVector.getShape();

// vector element type

Type aType = aVector.getElementType();

// tensor float32 (TF32) enabled

bool tf32Enabled = getOperation()->hasAttr(getTf32EnabledAttrName());

christopherbateUnsubmitted

Done

// tensor float32 (TF32) enabled

- bool tf32Enabled = (*this)->hasAttr(getTf32EnabledAttrName());

+ bool tf32Enabled = getOperation()->hasAttr(getTf32EnabledAttrName());

// nvgpu.mma.sync shape (per 32 threads or per warp)

christopherbate:

ThomasRaouxUnsubmitted

Done

nit: could be made a member function.

ThomasRaoux: nit: could be made a member function.

manishucsdAuthorUnsubmitted

Done

Thanks. Updated.

manishucsd: Thanks. Updated.

// nvgpu.mma.sync shape (per 32 threads or per warp)

int64_t m = getMmaShape()[0].cast<IntegerAttr>().getInt();

int64_t n = getMmaShape()[1].cast<IntegerAttr>().getInt();

int64_t k = getMmaShape()[2].cast<IntegerAttr>().getInt();

if (aType.isF64()) {

// exception to 8-by-8-128b fundamental tensor core tile size

shapeK = 4;

Show All 25 Lines

if (bShape[0] * bShape[1] * kThreads != k * n)

return emitOpError() << "expected " << k * n

<< " warp-wide matrix B elements";

// verify warp-wide size for vector c

if (cShape[0] * cShape[1] * kThreads != m * n)

return emitOpError() << "expected " << m * n

<< " warp-wide matrix C elements";

// verify tf32 tensor cores are enabled for only F32 datatype

if (tf32Enabled && !(aType.isF32()))

christopherbateUnsubmitted

Done

Braces can be removed

christopherbate: Braces can be removed

manishucsdAuthorUnsubmitted

Done

Thanks for catching this. clang doesn't remove these braces... hmm

manishucsd: Thanks for catching this. clang doesn't remove these braces... hmm

return emitOpError() << "expected tf32 tensor cores only for F32 operands";

// Extended verification

// tiles of fundamental tensor core operations

int64_t mTile = m / shapeM;

int64_t nTile = n / shapeN;

int64_t kTile = k / shapeK;

▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRNVGPUTransforms			add_mlir_dialect_library(MLIRNVGPUTransforms
	OptimizeSharedMemory.cpp			OptimizeSharedMemory.cpp
				MmaSyncTF32Transform.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU

	DEPENDS			DEPENDS
	MLIRNVGPUPassIncGen			MLIRNVGPUPassIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	Show All 10 Lines

mlir/lib/Dialect/NVGPU/Transforms/MmaSyncTF32Transform.cpp

This file was added.

//===- OptimizeSharedMemory.cpp - MLIR NVGPU pass implementation ----------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This file implements transforms to enable 1xtf32 and 3xtf32 nvgpu.mma sync

// operations on f32 input datatype

//===----------------------------------------------------------------------===//

#include "PassDetail.h"

#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"

#include "mlir/Dialect/GPU/IR/GPUDialect.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"

#include "mlir/Dialect/NVGPU/Passes.h"

#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"

#include "mlir/Dialect/Vector/IR/VectorOps.h"

#include "mlir/Interfaces/SideEffectInterfaces.h"

#include "mlir/Support/LogicalResult.h"

#include "llvm/ADT/STLExtras.h"

#include "llvm/Support/MathExtras.h"

using namespace mlir;

christopherbateUnsubmitted

Done

#include "llvm/Support/MathExtras.h"

- #include <iostream>

using namespace mlir;

christopherbate:

using namespace mlir::nvgpu;

namespace {

struct MmaSyncF32ToTF32Pattern : public OpRewritePattern<nvgpu::MmaSyncOp> {

using OpRewritePattern<nvgpu::MmaSyncOp>::OpRewritePattern;

MmaSyncF32ToTF32Pattern(MLIRContext *context,

nvgpu::MmaSyncF32Lowering precision)

: OpRewritePattern<nvgpu::MmaSyncOp>(context, /*benifit*/ 1),

precision(precision) {}

LogicalResult matchAndRewrite(nvgpu::MmaSyncOp op,

PatternRewriter &rewrite) const override {

Location location = op->getLoc();

if (op->hasAttr(op.getTf32EnabledAttrName()))

return failure();

if (precision == MmaSyncF32Lowering::Unkown)

return emitError(location, "MmaSync F32-to-TF32 cannot be lowered with "

"unknown precision level");

if (precision == MmaSyncF32Lowering::TF32x3)

return emitError(location, "TF32x3 is not supported at the moment "

"for nvgpu.mma.sync on f32 datatype");

if (precision == MmaSyncF32Lowering::TF32)

op.setTf32EnabledAttr(rewrite.getUnitAttr());

return success();

}

private:

/// Precision for F32 Tensor Cores (TF32 or TF32x3)

nvgpu::MmaSyncF32Lowering precision;

};

} // namespace

void mlir::nvgpu::populateMmaSyncF32ToTF32Patterns(

RewritePatternSet &patterns, nvgpu::MmaSyncF32Lowering precision) {

patterns.add<MmaSyncF32ToTF32Pattern>(patterns.getContext(), precision);

}

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	func.func @m16n8k4_tf32(%arg0: vector<2x1xf32>, %arg1: vector<1x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
// CHECK: llvm.extractvalue		// CHECK: llvm.extractvalue
// CHECK: llvm.bitcast {{.*}} : vector<1xf32> to i32		// CHECK: llvm.bitcast {{.*}} : vector<1xf32> to i32

// CHECK: [[d:%.+]] = nvvm.mma.sync A[{{%.+}}, {{%.+}}] B[{{%.+}}] C[{{%.+}}, {{%.+}}, {{%.+}}, {{%.+}}]		// CHECK: [[d:%.+]] = nvvm.mma.sync A[{{%.+}}, {{%.+}}] B[{{%.+}}] C[{{%.+}}, {{%.+}}, {{%.+}}, {{%.+}}]
// CHECK-SAME: multiplicandAPtxType = #nvvm.mma_type<tf32>		// CHECK-SAME: multiplicandAPtxType = #nvvm.mma_type<tf32>
// CHECK-SAME: multiplicandBPtxType = #nvvm.mma_type<tf32>		// CHECK-SAME: multiplicandBPtxType = #nvvm.mma_type<tf32>
// CHECK-SAME: shape = #nvvm.shape<m = 16, n = 8, k = 4>		// CHECK-SAME: shape = #nvvm.shape<m = 16, n = 8, k = 4>
// CHECK-SAME: -> !llvm.struct<(f32, f32, f32, f32)>		// CHECK-SAME: -> !llvm.struct<(f32, f32, f32, f32)>
%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 4]} : (vector<2x1xf32>, vector<1x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>		%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 4], tf32Enabled} : (vector<2x1xf32>, vector<1x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
// CHECK: [[undef:%.+]] = llvm.mlir.undef : vector<2xf32>		// CHECK: [[undef:%.+]] = llvm.mlir.undef : vector<2xf32>
// CHECK-DAG: llvm.extractvalue [[d]][0] : !llvm.struct<(f32, f32, f32, f32)>		// CHECK-DAG: llvm.extractvalue [[d]][0] : !llvm.struct<(f32, f32, f32, f32)>
// CHECK-DAG: llvm.extractvalue [[d]][1] : !llvm.struct<(f32, f32, f32, f32)>		// CHECK-DAG: llvm.extractvalue [[d]][1] : !llvm.struct<(f32, f32, f32, f32)>
// CHECK: [[d00:%.+]] = llvm.insertelement {{%.+}}, [[undef]][{{.*}}] : vector<2xf32>		// CHECK: [[d00:%.+]] = llvm.insertelement {{%.+}}, [[undef]][{{.*}}] : vector<2xf32>
// CHECK: [[d01:%.+]] = llvm.insertelement {{%.+}}, [[d00]][{{.*}}] : vector<2xf32>		// CHECK: [[d01:%.+]] = llvm.insertelement {{%.+}}, [[d00]][{{.*}}] : vector<2xf32>

// CHECK: [[undef:%.+]] = llvm.mlir.undef : vector<2xf32>		// CHECK: [[undef:%.+]] = llvm.mlir.undef : vector<2xf32>
// CHECK-DAG: llvm.extractvalue [[d]][2] : !llvm.struct<(f32, f32, f32, f32)>		// CHECK-DAG: llvm.extractvalue [[d]][2] : !llvm.struct<(f32, f32, f32, f32)>
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

mlir/test/Dialect/NVGPU/invalid.mlir

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines

	func.func @m16n8k16_fp16_vector_shape_a_extended(%arg0: vector<2x4xf16>, %arg1: vector<2x2xf16>, %arg2: vector<2x2xf16>) -> vector<2x2xf16> {			func.func @m16n8k16_fp16_vector_shape_a_extended(%arg0: vector<2x4xf16>, %arg1: vector<2x2xf16>, %arg2: vector<2x2xf16>) -> vector<2x2xf16> {
	// expected-error @+1 {{expected matrix A to be shaped (4 x 2)}}			// expected-error @+1 {{expected matrix A to be shaped (4 x 2)}}
	%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} : (vector<2x4xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>			%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} : (vector<2x4xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
	return %d : vector<2x2xf16>			return %d : vector<2x2xf16>
	}			}
	// -----			// -----

				func.func @m16n8k16_fp16_tf32Enabled(%arg0: vector<4x2xf16>, %arg1: vector<2x2xf16>, %arg2: vector<2x2xf16>) -> vector<2x2xf16> {
				// expected-error @+1 {{expected tf32 tensor cores only for F32 operands}}
				%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16], tf32Enabled} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
				return %d : vector<2x2xf16>
				}
				// -----

	func.func @m16n8k8_fp32_vector_shape_a(%arg0: vector<4x2xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {			func.func @m16n8k8_fp32_vector_shape_a(%arg0: vector<4x2xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
	// expected-error @+1 {{expected 128 warp-wide matrix A elements}}			// expected-error @+1 {{expected 128 warp-wide matrix A elements}}
	%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 8]} : (vector<4x2xf32>, vector<2x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>			%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 8]} : (vector<4x2xf32>, vector<2x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
	return %d : vector<2x2xf32>			return %d : vector<2x2xf32>
	}			}
	// -----			// -----

	func.func @m16n8k8_fp32_vector_shape_a_extended(%arg0: vector<1x4xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {			func.func @m16n8k8_fp32_vector_shape_a_extended(%arg0: vector<1x4xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32.mlir

This file was added.

				// RUN: mlir-opt %s -test-nvgpu-mmasync-f32-to-tf32-patterns="precision=tf32" -split-input-file \| FileCheck %s

				// CHECK-LABEL: m16n8k4_tf32
				func.func @m16n8k4_tf32(%arg0: vector<2x1xf32>, %arg1: vector<1x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
				// CHECK: nvgpu.mma.sync
				// CHECK-SAME: tf32Enabled
				%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 4]} : (vector<2x1xf32>, vector<1x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
				christopherbateUnsubmitted Done Reply Inline Actions This test might read more naturally if you change to // CHECK: nvgpu.mma.sync // CHECK-SAME: tf32Enabled christopherbate: This test might read more naturally if you change to ``` // CHECK: nvgpu.mma.sync //…
				return %d : vector<2x2xf32>
				}

				// -----

				// CHECK-LABEL: m16n8k8_tf32
				func.func @m16n8k8_tf32(%arg0: vector<4x1xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
				// CHECK: nvgpu.mma.sync
				// CHECK-SAME: tf32Enabled
				%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 8]} : (vector<4x1xf32>, vector<2x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
				return %d : vector<2x2xf32>
				}
				// -----

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32x3.mlir

This file was added.

				// RUN: mlir-opt %s -test-nvgpu-mmasync-f32-to-tf32-patterns="precision=tf32x3" -split-input-file \| FileCheck %s

				// CHECK-LABEL: m16n8k4_tf32
				func.func @m16n8k4_tf32(%arg0: vector<2x1xf32>, %arg1: vector<1x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
				// expected-error @+1 {{TF32x3 is not supported at the moment for nvgpu.mma.sync on f32 datatype}}
				%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 4]} : (vector<2x1xf32>, vector<1x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
				return %d : vector<2x2xf32>
				}

				// -----

				// CHECK-LABEL: m16n8k8_tf32
				func.func @m16n8k8_tf32(%arg0: vector<4x1xf32>, %arg1: vector<2x1xf32>, %arg2: vector<2x2xf32>) -> vector<2x2xf32> {
				// expected-error @+1 {{TF32x3 is not supported at the moment for nvgpu.mma.sync on f32 datatype}}
				%d = nvgpu.mma.sync (%arg0, %arg1, %arg2) {mmaShape = [16, 8, 8]} : (vector<4x1xf32>, vector<2x1xf32>, vector<2x2xf32>) -> vector<2x2xf32>
				return %d : vector<2x2xf32>
				}
				// -----

mlir/test/lib/Dialect/CMakeLists.txt

	add_subdirectory(Affine)			add_subdirectory(Affine)
	add_subdirectory(DLTI)			add_subdirectory(DLTI)
	add_subdirectory(Func)			add_subdirectory(Func)
	add_subdirectory(GPU)			add_subdirectory(GPU)
	add_subdirectory(Linalg)			add_subdirectory(Linalg)
	add_subdirectory(Math)			add_subdirectory(Math)
	add_subdirectory(MemRef)			add_subdirectory(MemRef)
				add_subdirectory(NVGPU)
	add_subdirectory(SCF)			add_subdirectory(SCF)
	add_subdirectory(Shape)			add_subdirectory(Shape)
	add_subdirectory(SPIRV)			add_subdirectory(SPIRV)
	add_subdirectory(Tensor)			add_subdirectory(Tensor)
	add_subdirectory(Test)			add_subdirectory(Test)
	add_subdirectory(Tosa)			add_subdirectory(Tosa)
	add_subdirectory(Transform)			add_subdirectory(Transform)
	add_subdirectory(Vector)			add_subdirectory(Vector)

mlir/test/lib/Dialect/NVGPU/CMakeLists.txt

This file was added.

				# Exclude tests from libMLIR.so
				add_mlir_library(MLIRNVGPUTestPasses
				TestNVGPUTransforms.cpp

				EXCLUDE_FROM_LIBMLIR

				LINK_LIBS PUBLIC
				MLIRIR
				MLIRAffineDialect
				MLIRAnalysis
				MLIRFuncDialect
				MLIRGPUOps
				MLIRLLVMDialect
				MLIRMemRefDialect
				MLIRNVGPUDialect
				MLIRNVGPUTransforms
				MLIRPass
				MLIRSCFDialect
				MLIRTransformUtils
				)

mlir/test/lib/Dialect/NVGPU/TestNVGPUTransforms.cpp

This file was added.

				//===- TestNVGPUTransforms.cpp - Test NVGPU transforms and lowerings ----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include <type_traits>

				#include "mlir/Analysis/SliceAnalysis.h"
				#include "mlir/Dialect/Affine/IR/AffineOps.h"
				#include "mlir/Dialect/Func/IR/FuncOps.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/Linalg/IR/Linalg.h"
				#include "mlir/Dialect/Linalg/Passes.h"
				#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
				#include "mlir/Dialect/MemRef/IR/MemRef.h"
				#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"
				#include "mlir/Dialect/SCF/IR/SCF.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Pass/PassManager.h"
				#include "mlir/Support/LLVM.h"
				#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

				using namespace mlir;
				using namespace mlir::nvgpu;

				namespace {

				struct TestMmaSyncF32ToTF32Patterns
				: public PassWrapper<TestMmaSyncF32ToTF32Patterns,
				OperationPass<func::FuncOp>> {
				MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestMmaSyncF32ToTF32Patterns)

				StringRef getArgument() const final {
				return "test-nvgpu-mmasync-f32-to-tf32-patterns";
				}
				StringRef getDescription() const final {
				return "Test patterns to convert mma.sync on f32 with tf32 precision";
				}
				TestMmaSyncF32ToTF32Patterns() = default;
				TestMmaSyncF32ToTF32Patterns(const TestMmaSyncF32ToTF32Patterns &pass)
				: PassWrapper(pass) {}

				Option<std::string> precision{
				*this, "precision",
				llvm::cl::desc(
				"Target nvgpu.mma.sync on f32 input with tf32 or tf32x3 precision"),
				llvm::cl::init("tf32")};

				MmaSyncF32Lowering tf32Precision =
				llvm::StringSwitch<MmaSyncF32Lowering>(precision)
				.Case("tf32", MmaSyncF32Lowering::TF32)
				.Case("tf32x3", MmaSyncF32Lowering::TF32x3)
				.Default(MmaSyncF32Lowering::Unkown);

				void runOnOperation() override {
				RewritePatternSet patterns(&getContext());

				populateMmaSyncF32ToTF32Patterns(patterns, tf32Precision);
				christopherbateUnsubmitted Done Reply Inline Actions Maybe just make the option a string then `-precision=tf32` or `-precision=3xtf32` christopherbate: Maybe just make the option a string then `-precision=tf32` or `-precision=3xtf32`
				(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
				}
				};

				} // namespace

				namespace mlir {
				namespace test {
				void registerTestNvgpuLowerings() {
				PassRegistration<TestMmaSyncF32ToTF32Patterns>();
				}

				} // namespace test
				} // namespace mlir
				No newline at end of file

mlir/tools/mlir-opt/CMakeLists.txt

Show All 14 Lines	set(test_libs
MLIRTestFuncToLLVM		MLIRTestFuncToLLVM
MLIRAffineTransformsTestPasses		MLIRAffineTransformsTestPasses
MLIRDLTITestPasses		MLIRDLTITestPasses
MLIRFuncTestPasses		MLIRFuncTestPasses
MLIRGPUTestPasses		MLIRGPUTestPasses
MLIRLinalgTestPasses		MLIRLinalgTestPasses
MLIRMathTestPasses		MLIRMathTestPasses
MLIRMemRefTestPasses		MLIRMemRefTestPasses
		MLIRNVGPUTestPasses
MLIRSCFTestPasses		MLIRSCFTestPasses
MLIRShapeTestPasses		MLIRShapeTestPasses
MLIRSPIRVTestPasses		MLIRSPIRVTestPasses
MLIRTensorTestPasses		MLIRTensorTestPasses
MLIRTestAnalysis		MLIRTestAnalysis
MLIRTestDialect		MLIRTestDialect
MLIRTestIR		MLIRTestIR
MLIRTestPass		MLIRTestPass
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

mlir/tools/mlir-opt/mlir-opt.cpp

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
void registerTestPreparationPassWithAllowedMemrefResults();		void registerTestPreparationPassWithAllowedMemrefResults();
void registerTestRecursiveTypesPass();		void registerTestRecursiveTypesPass();
void registerTestSCFUtilsPass();		void registerTestSCFUtilsPass();
void registerTestSliceAnalysisPass();		void registerTestSliceAnalysisPass();
void registerTestTensorTransforms();		void registerTestTensorTransforms();
void registerTestTilingInterface();		void registerTestTilingInterface();
void registerTestTransformDialectInterpreterPass();		void registerTestTransformDialectInterpreterPass();
void registerTestVectorLowerings();		void registerTestVectorLowerings();
		void registerTestNvgpuLowerings();
} // namespace test		} // namespace test
} // namespace mlir		} // namespace mlir

namespace test {		namespace test {
void registerTestDialect(DialectRegistry &);		void registerTestDialect(DialectRegistry &);
void registerTestTransformDialectExtension(DialectRegistry &);		void registerTestTransformDialectExtension(DialectRegistry &);
} // namespace test		} // namespace test

▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	#endif
mlir::test::registerTestPDLLPasses();		mlir::test::registerTestPDLLPasses();
mlir::test::registerTestRecursiveTypesPass();		mlir::test::registerTestRecursiveTypesPass();
mlir::test::registerTestSCFUtilsPass();		mlir::test::registerTestSCFUtilsPass();
mlir::test::registerTestSliceAnalysisPass();		mlir::test::registerTestSliceAnalysisPass();
mlir::test::registerTestTensorTransforms();		mlir::test::registerTestTensorTransforms();
mlir::test::registerTestTilingInterface();		mlir::test::registerTestTilingInterface();
mlir::test::registerTestTransformDialectInterpreterPass();		mlir::test::registerTestTransformDialectInterpreterPass();
mlir::test::registerTestVectorLowerings();		mlir::test::registerTestVectorLowerings();
		mlir::test::registerTestNvgpuLowerings();
}		}
#endif		#endif

int main(int argc, char **argv) {		int main(int argc, char **argv) {
registerAllPasses();		registerAllPasses();
#ifdef MLIR_INCLUDE_TESTS		#ifdef MLIR_INCLUDE_TESTS
registerTestPasses();		registerTestPasses();
#endif		#endif
Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU] nvgpu.mmasync on F32 through TF32
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 449152

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt

mlir/lib/Dialect/NVGPU/Transforms/MmaSyncTF32Transform.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

mlir/test/Dialect/NVGPU/invalid.mlir

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32.mlir

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32x3.mlir

mlir/test/lib/Dialect/CMakeLists.txt

mlir/test/lib/Dialect/NVGPU/CMakeLists.txt

mlir/test/lib/Dialect/NVGPU/TestNVGPUTransforms.cpp

mlir/tools/mlir-opt/CMakeLists.txt

mlir/tools/mlir-opt/mlir-opt.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU] nvgpu.mmasync on F32 through TF32ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 449152

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt

mlir/lib/Dialect/NVGPU/Transforms/MmaSyncTF32Transform.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

mlir/test/Dialect/NVGPU/invalid.mlir

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32.mlir

mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32x3.mlir

mlir/test/lib/Dialect/CMakeLists.txt

mlir/test/lib/Dialect/NVGPU/CMakeLists.txt

mlir/test/lib/Dialect/NVGPU/TestNVGPUTransforms.cpp

mlir/tools/mlir-opt/CMakeLists.txt

mlir/tools/mlir-opt/mlir-opt.cpp

[mlir][NVGPU] nvgpu.mmasync on F32 through TF32
ClosedPublic