This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
GPU/
-
GPUBase.td
-
GPUDialect.h
-
GPUOps.td
-
NVGPU/
1/2
NVGPU.td
-
NVGPUDialect.h
-
lib/
-
Conversion/
-
GPUToNVVM/
-
LowerGpuOpsToNVVMOps.cpp
-
NVGPUToNVVM/
-
CMakeLists.txt
1/2
NVGPUToNVVM.cpp
-
Dialect/
-
GPU/IR/
-
IR/
-
GPUDialect.cpp
-
NVGPU/IR/
-
IR/
-
CMakeLists.txt
-
NVGPUDialect.cpp
-
test/
-
Conversion/
-
GPUToNVVM/
-
gpu-to-nvvm.mlir
-
NVGPUToNVVM/
-
mma-sync-to-nvvm.mlir
-
nvgpu-to-nvvm.mlir
-
Dialect/
-
GPU/
-
invalid.mlir
-
ops.mlir
-
NVGPU/
-
invalid.mlir
-
roundtrip.mlir
-
utils/bazel/llvm-project-overlay/mlir/
-
bazel/
-
llvm-project-overlay/
-
mlir/
-
BUILD.bazel

Differential D125244

[mlir][gpu] Move async copy ops to NVGPU and add caching hints
ClosedPublic

Authored by ThomasRaoux on May 9 2022, 10:34 AM.

Download Raw Diff

Details

Reviewers

christopherbate
nirvedhmeshram
bondhugula
herhut
ftynse
mravishankar

Commits

rG15bcc36eede1: [mlir][gpu] Move async copy ops to NVGPU and add caching hints

Summary

Move async copy operations to NVGPU as they only exist on NV target and are designed to match ptx semantic. This allows us to also add more fine grain caching hint attribute to the op.
Add hint to bypass L1 and hook it up to NVVM op.

Diff Detail

Event Timeline

ThomasRaoux created this revision.May 9 2022, 10:34 AM

Herald added a reviewer: mravishankar. · View Herald TranscriptMay 9 2022, 10:34 AM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: mattd, gchakrabarti, sdasgup3 and 24 others. · View Herald Transcript

ThomasRaoux requested review of this revision.May 9 2022, 10:34 AM

Herald added a reviewer: herhut. · View Herald TranscriptMay 9 2022, 10:34 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B163517: Diff 428128.May 9 2022, 10:35 AM

A couple of minor things, otherwise LGTM.

mlir/include/mlir/Dialect/NVGPU/NVGPU.td
172	This would change the diff from being pure code movement, but can't this have `NoSideEffects`?
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
160	Is there a source of truth for this in NVVM dialect? This (renamed) would be useful to have in the dialect header.

rebase and address review comment

Herald added a reviewer: ftynse. · View Herald TranscriptMay 9 2022, 1:05 PM

Herald added a subscriber: awarzynski. · View Herald Transcript

ThomasRaoux added inline comments.May 9 2022, 1:08 PM

mlir/include/mlir/Dialect/NVGPU/NVGPU.td
172	The problem is that at this point we don't want re-ordering of those operations with unrelated commit op as we don't have code to reorder correctly when we lower those ops so we have to rely on operations order. This is something we should improve but I don't have a good solution at this point.
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
160	good point, moved it there

Harbormaster completed remote builds in B163558: Diff 428182.May 9 2022, 4:02 PM

mravishankar resigned from this revision.May 10 2022, 9:02 AM

LGTM, thanks!

This revision is now accepted and ready to land.May 10 2022, 2:18 PM

Closed by commit rG15bcc36eede1: [mlir][gpu] Move async copy ops to NVGPU and add caching hints (authored by ThomasRaoux). · Explain WhyMay 10 2022, 3:30 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG15bcc36eede1: [mlir][gpu] Move async copy ops to NVGPU and add caching hints.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUBase.td

7 lines

GPUDialect.h

8 lines

GPUOps.td

101 lines

NVGPU/

NVGPU.td

127 lines

NVGPUDialect.h

14 lines

lib/

Conversion/

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

92 lines

NVGPUToNVVM/

CMakeLists.txt

1 line

NVGPUToNVVM.cpp

100 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

31 lines

NVGPU/

IR/

CMakeLists.txt

1 line

NVGPUDialect.cpp

61 lines

test/

Conversion/

GPUToNVVM/

gpu-to-nvvm.mlir

32 lines

NVGPUToNVVM/

mma-sync-to-nvvm.mlir

	nvgpu-to-nvvm.mlir
	mma-sync-to-nvvm.mlir

37 lines

Dialect/

GPU/

invalid.mlir

56 lines

ops.mlir

12 lines

NVGPU/

invalid.mlir

55 lines

roundtrip.mlir

17 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

2 lines

Diff 428128

mlir/include/mlir/Dialect/GPU/GPUBase.td

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	def GPU_Dialect : Dialect {
let dependentDialects = ["arith::ArithmeticDialect"];		let dependentDialects = ["arith::ArithmeticDialect"];
let useDefaultAttributePrinterParser = 1;		let useDefaultAttributePrinterParser = 1;
}		}

def GPU_AsyncToken : DialectType<		def GPU_AsyncToken : DialectType<
GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,		GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,
BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;		BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;

/// Device-side synchronization token.
def GPU_DeviceAsyncToken : DialectType<
GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::DeviceAsyncTokenType>()">,
"device async token type">,
BuildableType<
"mlir::gpu::DeviceAsyncTokenType::get($_builder.getContext())">;

// Predicat to check if type is gpu::MMAMatrixType.		// Predicat to check if type is gpu::MMAMatrixType.
def IsMMAMatrixTypePred : CPred<"$_self.isa<::mlir::gpu::MMAMatrixType>()">;		def IsMMAMatrixTypePred : CPred<"$_self.isa<::mlir::gpu::MMAMatrixType>()">;

def GPU_MMAMatrix : DialectType<		def GPU_MMAMatrix : DialectType<
GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;		GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;

class MMAMatrixOf<list<Type> allowedTypes> :		class MMAMatrixOf<list<Type> allowedTypes> :
ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,		ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/GPUDialect.h

	Show All 37 Lines

	class AsyncTokenType			class AsyncTokenType
	: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {			: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {
	public:			public:
	// Used for generic hooks in TypeBase.			// Used for generic hooks in TypeBase.
	using Base::Base;			using Base::Base;
	};			};

	/// Device-side token storage type. There is only one type of device-side token.
	class DeviceAsyncTokenType
	: public Type::TypeBase<DeviceAsyncTokenType, Type, TypeStorage> {
	public:
	// Used for generic hooks in TypeBase.
	using Base::Base;
	};

	/// MMAMatrixType storage and uniquing. Array is uniqued based on its shape			/// MMAMatrixType storage and uniquing. Array is uniqued based on its shape
	/// and type.			/// and type.
	struct MMAMatrixStorageType : public TypeStorage {			struct MMAMatrixStorageType : public TypeStorage {
	MMAMatrixStorageType(unsigned numDims, const int64_t *dimShapes,			MMAMatrixStorageType(unsigned numDims, const int64_t *dimShapes,
	Type elementType, StringRef operand)			Type elementType, StringRef operand)
	: dimShapes(dimShapes), numDims(numDims), elementType(elementType),			: dimShapes(dimShapes), numDims(numDims), elementType(elementType),
	operand(operand) {}			operand(operand) {}

	▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 1,274 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
}		}
}];		}];

let assemblyFormat = [{		let assemblyFormat = [{
$operation $args attr-dict `:` functional-type($args, $res)		$operation $args attr-dict `:` functional-type($args, $res)
}];		}];
}		}

def GPU_DeviceAsyncCopyOp : GPU_Op<"device_async_copy",
[AttrSizedOperandSegments]> {
let summary = "device-side asynchronous copy";
let description = [{
The `gpu.device_async_copy` op initiates an asynchronous copy operation of
`$size` elements from source to the destination without blocking the thread.
The destination has to be in shared memory.

This is memory access will be pending to be added to a group.

This op is meant to be used with `gpu.device_async_create_group` and
`gpu.device_async_wait` to synchronize copies as explained in those ops
descriptions.

In order to do a copy and wait for the result we need the following
combination:
```
// copy 1.
%cp1 = gpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = gpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = gpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = gpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = gpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
gpu.device_async_wait %token1
// after the wait copy 3 is complete.
gpu.device_async_wait %token2
```

Example:

```mlir
%0 = gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
memref<4x5xf32> to memref<2x7x5xf32, 3>
```
}];
let results = (outs GPU_DeviceAsyncToken:$asyncToken);
let arguments = (ins Arg<AnyMemRef, "", [MemWrite]>:$dst,
Variadic<Index>:$dstIndices,
Arg<AnyMemRef, "", [MemRead]>:$src,
Variadic<Index>:$srcIndices,
IndexAttr:$numElements);
let assemblyFormat = [{
$src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $numElements
attr-dict `:` type($src) `to` type($dst)
}];
let hasVerifier = 1;
}

def GPU_DeviceAsyncCreateGroupOp : GPU_Op<"device_async_create_group", []> {
let summary = "device side asynchronous create group operation";
let description = [{
The `gpu.device_async_create_group` op creates a group of memory accesses
containing all the pending `device_async_copy` operations associated with
argument tokens. Each token can only be part of one group.

It returns a token that can be use to wait until the group fully completes.

This is meant to be used with `gpu.device_async_wait` to synchronize copies
as explained in those ops descriptions.

Groups are executed in the order they are created.

Example:

```mlir
%0 = gpu.device_async_create_group
```
}];
let results = (outs GPU_DeviceAsyncToken:$asyncToken);
let arguments = (ins Variadic<GPU_DeviceAsyncToken>:$inputTokens);
let assemblyFormat = [{
$inputTokens attr-dict
}];
}

def GPU_DeviceAsyncWaitOp : GPU_Op<"device_async_wait", []> {
let summary = "Wait for async gpu ops to complete.";
let description = [{
The `gpu.device_async_wait` op will block the execution thread until the group
associated with the source token is fully completed.

The optional `$numGroup` attribute gives a lower bound of the number of
groups uncompleted when the wait can unblock the thread.
Example:

```mlir
gpu.device_async_wait %0
```
}];
let arguments = (ins GPU_DeviceAsyncToken:$asyncDependencies,
OptionalAttr<I32Attr>:$numGroups);
let assemblyFormat = [{
$asyncDependencies attr-dict
}];
}

#endif // GPU_OPS		#endif // GPU_OPS

mlir/include/mlir/Dialect/NVGPU/NVGPU.td

Show All 26 Lines	def NVGPU_Dialect : Dialect {
let name = "nvgpu";		let name = "nvgpu";
let cppNamespace = "::mlir::nvgpu";		let cppNamespace = "::mlir::nvgpu";
let description = [{		let description = [{
This `NVGPU` dialect provides a bridge between the target agnostic GPU and		This `NVGPU` dialect provides a bridge between the target agnostic GPU and
Vector dialects and the lower level LLVM IR based NVVM dialect. This allow		Vector dialects and the lower level LLVM IR based NVVM dialect. This allow
representing PTX specific operations while using MLIR high level concepts		representing PTX specific operations while using MLIR high level concepts
like memref and 2-D vector.		like memref and 2-D vector.
}];		}];
		let useDefaultAttributePrinterParser = 1;
}		}

		/// Device-side synchronization token.
		def NVGPU_DeviceAsyncToken : DialectType<
		NVGPU_Dialect, CPred<"$_self.isa<::mlir::nvgpu::DeviceAsyncTokenType>()">,
		"device async token type">,
		BuildableType<
		"mlir::nvgpu::DeviceAsyncTokenType::get($_builder.getContext())">;


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// NVGPU Op definitions		// NVGPU Op definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class NVGPU_Op<string mnemonic, list<Trait> traits = []> :		class NVGPU_Op<string mnemonic, list<Trait> traits = []> :
Op<NVGPU_Dialect, mnemonic, traits> {}		Op<NVGPU_Dialect, mnemonic, traits> {}

def NVGPU_LdMatrixOp : NVGPU_Op<"ldmatrix",		def NVGPU_LdMatrixOp : NVGPU_Op<"ldmatrix",
Show All 23 Lines	let assemblyFormat = [{
$srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)		$srcMemref`[` $indices `]` attr-dict `:` type($srcMemref) `->` type($res)
}];		}];
}		}

def NVGPU_MmaSyncOp : NVGPU_Op<"mma.sync", [NoSideEffect]> {		def NVGPU_MmaSyncOp : NVGPU_Op<"mma.sync", [NoSideEffect]> {
let description = [{		let description = [{
The `nvgpu.mma.sync` op represents the distributed form of a collective		The `nvgpu.mma.sync` op represents the distributed form of a collective
matrix-multiply-and-accumulate (mma) operation that is compatible with		matrix-multiply-and-accumulate (mma) operation that is compatible with
`nvvm.mma.sync`. The operands and results are fragments of the full matrix		`nvvm.mma.sync`. The operands and results are fragments of the full matrix
operands. The full shape of the distributed mma operation is given by the		operands. The full shape of the distributed mma operation is given by the
`mmaShape` attribute in the form of a list of dimensions `[m, n, k]`.		`mmaShape` attribute in the form of a list of dimensions `[m, n, k]`.

This operation is meant to be lowered to the `nvvm.mma.sync` instruction, and		This operation is meant to be lowered to the `nvvm.mma.sync` instruction, and
is an intermediate point between lowering from `vector.contract` to		is an intermediate point between lowering from `vector.contract` to
`nvvm.mma.sync`.		`nvvm.mma.sync`.

This operation is meant to follow the semantic of described here:		This operation is meant to follow the semantic of described here:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma		https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Example:		Example:

```mlir		```mlir
nvgpu.mma.sync (%a, %b, %c) :		nvgpu.mma.sync (%a, %b, %c) :
(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>		(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
```		```
}];		}];
let arguments = (ins AnyVector:$matrixA, AnyVector:$matrixB,		let arguments = (ins AnyVector:$matrixA, AnyVector:$matrixB,
AnyVector:$matrixC, I64ArrayAttr:$mmaShape);		AnyVector:$matrixC, I64ArrayAttr:$mmaShape);

let results = (outs AnyVector:$res);		let results = (outs AnyVector:$res);

let assemblyFormat = [{		let assemblyFormat = [{
`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict		`(` $matrixA`,` $matrixB`,` $matrixC `)` attr-dict
`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)		`:` `(` type($matrixA) `,` type($matrixB) `,` type($matrixC) `)` `->` type($res)
}];		}];
}		}


		def NVGPU_DeviceAsyncCopyOp : NVGPU_Op<"device_async_copy",
		[AttrSizedOperandSegments]> {
		let summary = "device-side asynchronous copy";
		let description = [{
		The `gpu.device_async_copy` op initiates an asynchronous copy operation of
		`$size` elements from source to the destination without blocking the thread.
		The destination has to be in shared memory.

		This is memory access will be pending to be added to a group.

		This op is meant to be used with `gpu.device_async_create_group` and
		`gpu.device_async_wait` to synchronize copies as explained in those ops
		descriptions.
		`bypassL1` attribute is hint to the backend and hardware that
		the copy should by pass the L1 cache, this may be dropped by the backend or
		hardware.

		In order to do a copy and wait for the result we need the following
		combination:
		```
		// copy 1.
		%cp1 = gpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
		// copy 2.
		%cp2 = gpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
		// group 1 contains copy 1 and copy 2.
		%token1 = gpu.device_async_create_group %cp1, %cp2
		// copy 3.
		%cp3 = gpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
		// group 2 contains copy 3.
		%token2 = gpu.device_async_create_group %cp3
		// after the wait copy 1 and copy 2 are complete.
		gpu.device_async_wait %token1
		// after the wait copy 3 is complete.
		gpu.device_async_wait %token2
		```

		Example:

		```mlir
		%0 = gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 :
		memref<4x5xf32> to memref<2x7x5xf32, 3>
		```
		}];
		let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
		let arguments = (ins Arg<AnyMemRef, "", [MemWrite]>:$dst,
		Variadic<Index>:$dstIndices,
		Arg<AnyMemRef, "", [MemRead]>:$src,
		Variadic<Index>:$srcIndices,
		IndexAttr:$numElements,
		OptionalAttr<UnitAttr>:$bypassL1);
		let assemblyFormat = [{
		$src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $numElements
		attr-dict `:` type($src) `to` type($dst)
		}];
		let hasVerifier = 1;
		}

		def NVGPU_DeviceAsyncCreateGroupOp : NVGPU_Op<"device_async_create_group", []> {
		christopherbateUnsubmitted Not Done Reply Inline Actions This would change the diff from being pure code movement, but can't this have `NoSideEffects`? christopherbate: This would change the diff from being pure code movement, but can't this have `NoSideEffects`?
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions The problem is that at this point we don't want re-ordering of those operations with unrelated commit op as we don't have code to reorder correctly when we lower those ops so we have to rely on operations order. This is something we should improve but I don't have a good solution at this point. ThomasRaoux: The problem is that at this point we don't want re-ordering of those operations with unrelated…
		let summary = "device side asynchronous create group operation";
		let description = [{
		The `gpu.device_async_create_group` op creates a group of memory accesses
		containing all the pending `device_async_copy` operations associated with
		argument tokens. Each token can only be part of one group.

		It returns a token that can be use to wait until the group fully completes.

		This is meant to be used with `gpu.device_async_wait` to synchronize copies
		as explained in those ops descriptions.

		Groups are executed in the order they are created.

		Example:

		```mlir
		%0 = gpu.device_async_create_group
		```
		}];
		let results = (outs NVGPU_DeviceAsyncToken:$asyncToken);
		let arguments = (ins Variadic<NVGPU_DeviceAsyncToken>:$inputTokens);
		let assemblyFormat = [{
		$inputTokens attr-dict
		}];
		}

		def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
		let summary = "Wait for async gpu ops to complete.";
		let description = [{
		The `gpu.device_async_wait` op will block the execution thread until the group
		associated with the source token is fully completed.

		The optional `$numGroup` attribute gives a lower bound of the number of
		groups uncompleted when the wait can unblock the thread.
		Example:

		```mlir
		gpu.device_async_wait %0
		```
		}];
		let arguments = (ins NVGPU_DeviceAsyncToken:$asyncDependencies,
		OptionalAttr<I32Attr>:$numGroups);
		let assemblyFormat = [{
		$asyncDependencies attr-dict
		}];
		}

#endif // NVGPU		#endif // NVGPU

mlir/include/mlir/Dialect/NVGPU/NVGPUDialect.h

	Show All 12 Lines
	#ifndef MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_			#ifndef MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_
	#define MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_			#define MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_

	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/Dialect.h"			#include "mlir/IR/Dialect.h"
	#include "mlir/IR/OpDefinition.h"			#include "mlir/IR/OpDefinition.h"
	#include "mlir/Interfaces/SideEffectInterfaces.h"			#include "mlir/Interfaces/SideEffectInterfaces.h"

				namespace mlir {
				namespace nvgpu {

				/// Device-side token storage type. There is only one type of device-side token.
				class DeviceAsyncTokenType
				: public Type::TypeBase<DeviceAsyncTokenType, Type, TypeStorage> {
				public:
				// Used for generic hooks in TypeBase.
				using Base::Base;
				};

				} // namespace nvgpu
				} // namespace mlir

	#include "mlir/Dialect/NVGPU/NVGPUDialect.h.inc"			#include "mlir/Dialect/NVGPU/NVGPUDialect.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/NVGPU/NVGPU.h.inc"			#include "mlir/Dialect/NVGPU/NVGPU.h.inc"

	#endif // MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_			#endif // MLIR_DIALECT_NVGPU_NVGPUDIALECT_H_

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show All 36 Lines
#include "../GPUCommon/IndexIntrinsicsOpLowering.h"		#include "../GPUCommon/IndexIntrinsicsOpLowering.h"
#include "../GPUCommon/OpToFuncCallLowering.h"		#include "../GPUCommon/OpToFuncCallLowering.h"
#include "../PassDetail.h"		#include "../PassDetail.h"

using namespace mlir;		using namespace mlir;

namespace {		namespace {

/// NVVM memory space identifiers.
enum NVVMMemorySpace {
/// Global memory space identifier.
kGlobalMemorySpace = 1,
/// Shared memory space identifier.
kSharedMemorySpace = 3
};

/// Convert gpu dialect shfl mode enum to the equivalent nvvm one.		/// Convert gpu dialect shfl mode enum to the equivalent nvvm one.
static NVVM::ShflKind convertShflKind(gpu::ShuffleMode mode) {		static NVVM::ShflKind convertShflKind(gpu::ShuffleMode mode) {
switch (mode) {		switch (mode) {
case gpu::ShuffleMode::XOR:		case gpu::ShuffleMode::XOR:
return NVVM::ShflKind::bfly;		return NVVM::ShflKind::bfly;
case gpu::ShuffleMode::UP:		case gpu::ShuffleMode::UP:
return NVVM::ShflKind::up;		return NVVM::ShflKind::up;
case gpu::ShuffleMode::DOWN:		case gpu::ShuffleMode::DOWN:
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	matchAndRewrite(gpu::ShuffleOp op, OpAdaptor adaptor,
Value isActiveSrcLane = rewriter.create<LLVM::ExtractValueOp>(		Value isActiveSrcLane = rewriter.create<LLVM::ExtractValueOp>(
loc, predTy, shfl, rewriter.getIndexArrayAttr(1));		loc, predTy, shfl, rewriter.getIndexArrayAttr(1));

rewriter.replaceOp(op, {shflValue, isActiveSrcLane});		rewriter.replaceOp(op, {shflValue, isActiveSrcLane});
return success();		return success();
}		}
};		};

struct GPUAsyncCopyLowering
: public ConvertOpToLLVMPattern<gpu::DeviceAsyncCopyOp> {
using ConvertOpToLLVMPattern<gpu::DeviceAsyncCopyOp>::ConvertOpToLLVMPattern;

LogicalResult
matchAndRewrite(gpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
Location loc = op->getLoc();
auto dstMemrefType = op.dst().getType().cast<MemRefType>();
Value dstPtr = getStridedElementPtr(loc, dstMemrefType, adaptor.dst(),
adaptor.dstIndices(), rewriter);
auto i8Ty = IntegerType::get(op.getContext(), 8);
auto dstPointerType =
LLVM::LLVMPointerType::get(i8Ty, dstMemrefType.getMemorySpaceAsInt());
dstPtr = rewriter.create<LLVM::BitcastOp>(loc, dstPointerType, dstPtr);

auto srcMemrefType = op.src().getType().cast<MemRefType>();

Value scrPtr = getStridedElementPtr(loc, srcMemrefType, adaptor.src(),
adaptor.srcIndices(), rewriter);
auto srcPointerType =
LLVM::LLVMPointerType::get(i8Ty, srcMemrefType.getMemorySpaceAsInt());
scrPtr = rewriter.create<LLVM::BitcastOp>(loc, srcPointerType, scrPtr);
// Intrinsics takes a global pointer so we need an address space cast.
auto srcPointerGlobalType =
LLVM::LLVMPointerType::get(i8Ty, NVVMMemorySpace::kGlobalMemorySpace);
scrPtr = rewriter.create<LLVM::AddrSpaceCastOp>(loc, srcPointerGlobalType,
scrPtr);
int64_t numElements = adaptor.numElements().getZExtValue();
int64_t sizeInBytes =
(dstMemrefType.getElementTypeBitWidth() / 8) * numElements;
rewriter.create<NVVM::CpAsyncOp>(loc, dstPtr, scrPtr,
rewriter.getI32IntegerAttr(sizeInBytes),
/bypassL1=/UnitAttr());

// Drop the result token.
Value zero = rewriter.create<LLVM::ConstantOp>(
op->getLoc(), IntegerType::get(op.getContext(), 32),
rewriter.getI32IntegerAttr(0));
rewriter.replaceOp(op, zero);
return success();
}
};

struct GPUAsyncCreateGroupLowering
: public ConvertOpToLLVMPattern<gpu::DeviceAsyncCreateGroupOp> {
using ConvertOpToLLVMPattern<
gpu::DeviceAsyncCreateGroupOp>::ConvertOpToLLVMPattern;

LogicalResult
matchAndRewrite(gpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
// Drop the result token.
Value zero = rewriter.create<LLVM::ConstantOp>(
op->getLoc(), IntegerType::get(op.getContext(), 32),
rewriter.getI32IntegerAttr(0));
rewriter.replaceOp(op, zero);
return success();
}
};

struct GPUAsyncWaitLowering
: public ConvertOpToLLVMPattern<gpu::DeviceAsyncWaitOp> {
using ConvertOpToLLVMPattern<gpu::DeviceAsyncWaitOp>::ConvertOpToLLVMPattern;

LogicalResult
matchAndRewrite(gpu::DeviceAsyncWaitOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
// If numGroup is not present pick 0 as a conservative correct value.
int32_t numGroups = adaptor.numGroups() ? *adaptor.numGroups() : 0;
rewriter.create<NVVM::CpAsyncWaitGroupOp>(op.getLoc(), numGroups);
rewriter.eraseOp(op);
return success();
}
};

struct GPULaneIdOpToNVVM : ConvertOpToLLVMPattern<gpu::LaneIdOp> {		struct GPULaneIdOpToNVVM : ConvertOpToLLVMPattern<gpu::LaneIdOp> {
using ConvertOpToLLVMPattern<gpu::LaneIdOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<gpu::LaneIdOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(gpu::LaneIdOp op, gpu::LaneIdOp::Adaptor adaptor,		matchAndRewrite(gpu::LaneIdOp op, gpu::LaneIdOp::Adaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
auto loc = op->getLoc();		auto loc = op->getLoc();
MLIRContext *context = rewriter.getContext();		MLIRContext *context = rewriter.getContext();
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	void runOnOperation() override {
/// converter drops the private memory space to support the use case above.		/// converter drops the private memory space to support the use case above.
LLVMTypeConverter converter(m.getContext(), options);		LLVMTypeConverter converter(m.getContext(), options);
converter.addConversion([&](MemRefType type) -> Optional<Type> {		converter.addConversion([&](MemRefType type) -> Optional<Type> {
if (type.getMemorySpaceAsInt() !=		if (type.getMemorySpaceAsInt() !=
gpu::GPUDialect::getPrivateAddressSpace())		gpu::GPUDialect::getPrivateAddressSpace())
return llvm::None;		return llvm::None;
return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));		return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));
});		});
/// device-side async tokens cannot be materialized in nvvm. We just convert
/// them to a dummy i32 type in order to easily drop them during conversion.
converter.addConversion([&](gpu::DeviceAsyncTokenType type) -> Type {
return converter.convertType(IntegerType::get(type.getContext(), 32));
});
// Lowering for MMAMatrixType.		// Lowering for MMAMatrixType.
converter.addConversion([&](gpu::MMAMatrixType type) -> Type {		converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
return convertMMAToLLVMType(type);		return convertMMAToLLVMType(type);
});		});
RewritePatternSet patterns(m.getContext());		RewritePatternSet patterns(m.getContext());
RewritePatternSet llvmPatterns(m.getContext());		RewritePatternSet llvmPatterns(m.getContext());

// Apply in-dialect lowering first. In-dialect lowering will replace ops		// Apply in-dialect lowering first. In-dialect lowering will replace ops
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	void mlir::populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,
patterns.add<OpToFuncCallLowering<math::RsqrtOp>>(converter, "__nv_rsqrtf",		patterns.add<OpToFuncCallLowering<math::RsqrtOp>>(converter, "__nv_rsqrtf",
"__nv_rsqrt");		"__nv_rsqrt");
patterns.add<OpToFuncCallLowering<math::SinOp>>(converter, "__nv_sinf",		patterns.add<OpToFuncCallLowering<math::SinOp>>(converter, "__nv_sinf",
"__nv_sin");		"__nv_sin");
patterns.add<OpToFuncCallLowering<math::SqrtOp>>(converter, "__nv_sqrtf",		patterns.add<OpToFuncCallLowering<math::SqrtOp>>(converter, "__nv_sqrtf",
"__nv_sqrt");		"__nv_sqrt");
patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__nv_tanhf",		patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__nv_tanhf",
"__nv_tanh");		"__nv_tanh");
patterns.add<GPUAsyncCopyLowering, GPUAsyncCreateGroupLowering,
GPUAsyncWaitLowering>(converter);
}		}

std::unique_ptr<OperationPass<gpu::GPUModuleOp>>		std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
mlir::createLowerGpuOpsToNVVMOpsPass(unsigned indexBitwidth) {		mlir::createLowerGpuOpsToNVVMOpsPass(unsigned indexBitwidth) {
return std::make_unique<LowerGpuOpsToNVVMOpsPass>(indexBitwidth);		return std::make_unique<LowerGpuOpsToNVVMOpsPass>(indexBitwidth);
}		}

mlir/lib/Conversion/NVGPUToNVVM/CMakeLists.txt

	add_mlir_conversion_library(MLIRNVGPUToNVVM			add_mlir_conversion_library(MLIRNVGPUToNVVM
	NVGPUToNVVM.cpp			NVGPUToNVVM.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/NVGPUToNVVM			${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/NVGPUToNVVM

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen

	LINK_COMPONENTS			LINK_COMPONENTS
	Core			Core

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
				MLIRGPUOps
	MLIRLLVMCommonConversion			MLIRLLVMCommonConversion
	MLIRLLVMIR			MLIRLLVMIR
	MLIRNVVMIR			MLIRNVVMIR
	MLIRNVGPU			MLIRNVGPU
	MLIRPass			MLIRPass
	MLIRTransforms			MLIRTransforms
	)			)

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//		//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"		#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"
#include "../PassDetail.h"		#include "../PassDetail.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
		#include "mlir/Dialect/GPU/GPUDialect.h"
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/NVGPU/NVGPUDialect.h"		#include "mlir/Dialect/NVGPU/NVGPUDialect.h"

using namespace mlir;		using namespace mlir;

/// Returns the type for the intrinsic given the vectorResultType of the		/// Returns the type for the intrinsic given the vectorResultType of the
/// `gpu.mma.sync` operation.		/// `gpu.mma.sync` operation.
static Type inferIntrinsicResultType(Type vectorResultType) {		static Type inferIntrinsicResultType(Type vectorResultType) {
▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = arrayTy.getNumElements(); i < e; ++i) {
}		}
result.push_back(toUse);		result.push_back(toUse);
}		}
return result;		return result;
}		}

namespace {		namespace {

		/// NVVM memory space identifiers.
		enum NVVMMemorySpace {
		christopherbateUnsubmitted Not Done Reply Inline Actions Is there a source of truth for this in NVVM dialect? This (renamed) would be useful to have in the dialect header. christopherbate: Is there a source of truth for this in NVVM dialect? This (renamed) would be useful to have in…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions good point, moved it there ThomasRaoux: good point, moved it there
		/// Global memory space identifier.
		kGlobalMemorySpace = 1,
		/// Shared memory space identifier.
		kSharedMemorySpace = 3
		};

struct MmaLdMatrixOpToNVVM : public ConvertOpToLLVMPattern<nvgpu::LdMatrixOp> {		struct MmaLdMatrixOpToNVVM : public ConvertOpToLLVMPattern<nvgpu::LdMatrixOp> {
using ConvertOpToLLVMPattern<nvgpu::LdMatrixOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<nvgpu::LdMatrixOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::LdMatrixOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::LdMatrixOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
MLIRContext *ctx = getContext();		MLIRContext *ctx = getContext();
Location loc = op->getLoc();		Location loc = op->getLoc();
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines

struct ConvertNVGPUToNVVMPass		struct ConvertNVGPUToNVVMPass
: public ConvertNVGPUToNVVMBase<ConvertNVGPUToNVVMPass> {		: public ConvertNVGPUToNVVMBase<ConvertNVGPUToNVVMPass> {
ConvertNVGPUToNVVMPass() = default;		ConvertNVGPUToNVVMPass() = default;

void runOnOperation() override {		void runOnOperation() override {
RewritePatternSet patterns(&getContext());		RewritePatternSet patterns(&getContext());
LLVMTypeConverter converter(&getContext());		LLVMTypeConverter converter(&getContext());
		/// device-side async tokens cannot be materialized in nvvm. We just convert
		/// them to a dummy i32 type in order to easily drop them during conversion.
		converter.addConversion([&](nvgpu::DeviceAsyncTokenType type) -> Type {
		return converter.convertType(IntegerType::get(type.getContext(), 32));
		});
populateNVGPUToNVVMConversionPatterns(converter, patterns);		populateNVGPUToNVVMConversionPatterns(converter, patterns);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
target.addLegalDialect<::mlir::LLVM::LLVMDialect>();		target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
target.addLegalDialect<::mlir::NVVM::NVVMDialect>();		target.addLegalDialect<::mlir::NVVM::NVVMDialect>();
if (failed(applyPartialConversion(getOperation(), target,		if (failed(applyPartialConversion(getOperation(), target,
std::move(patterns))))		std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}
};		};

		struct NVGPUAsyncCopyLowering
		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCopyOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::DeviceAsyncCopyOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op->getLoc();
		auto dstMemrefType = op.dst().getType().cast<MemRefType>();
		Value dstPtr = getStridedElementPtr(loc, dstMemrefType, adaptor.dst(),
		adaptor.dstIndices(), rewriter);
		auto i8Ty = IntegerType::get(op.getContext(), 8);
		auto dstPointerType =
		LLVM::LLVMPointerType::get(i8Ty, dstMemrefType.getMemorySpaceAsInt());
		dstPtr = rewriter.create<LLVM::BitcastOp>(loc, dstPointerType, dstPtr);

		auto srcMemrefType = op.src().getType().cast<MemRefType>();

		Value scrPtr = getStridedElementPtr(loc, srcMemrefType, adaptor.src(),
		adaptor.srcIndices(), rewriter);
		auto srcPointerType =
		LLVM::LLVMPointerType::get(i8Ty, srcMemrefType.getMemorySpaceAsInt());
		scrPtr = rewriter.create<LLVM::BitcastOp>(loc, srcPointerType, scrPtr);
		// Intrinsics takes a global pointer so we need an address space cast.
		auto srcPointerGlobalType =
		LLVM::LLVMPointerType::get(i8Ty, NVVMMemorySpace::kGlobalMemorySpace);
		scrPtr = rewriter.create<LLVM::AddrSpaceCastOp>(loc, srcPointerGlobalType,
		scrPtr);
		int64_t numElements = adaptor.numElements().getZExtValue();
		int64_t sizeInBytes =
		(dstMemrefType.getElementTypeBitWidth() / 8) * numElements;
		// bypass L1 is only supported for byte sizes of 16, we drop the hint
		// otherwise.
		UnitAttr bypassL1 = sizeInBytes == 16 ? adaptor.bypassL1Attr() : UnitAttr();
		rewriter.create<NVVM::CpAsyncOp>(
		loc, dstPtr, scrPtr, rewriter.getI32IntegerAttr(sizeInBytes), bypassL1);

		// Drop the result token.
		Value zero = rewriter.create<LLVM::ConstantOp>(
		op->getLoc(), IntegerType::get(op.getContext(), 32),
		rewriter.getI32IntegerAttr(0));
		rewriter.replaceOp(op, zero);
		return success();
		}
		};

		struct NVGPUAsyncCreateGroupLowering
		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCreateGroupOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::DeviceAsyncCreateGroupOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
		// Drop the result token.
		Value zero = rewriter.create<LLVM::ConstantOp>(
		op->getLoc(), IntegerType::get(op.getContext(), 32),
		rewriter.getI32IntegerAttr(0));
		rewriter.replaceOp(op, zero);
		return success();
		}
		};

		struct NVGPUAsyncWaitLowering
		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncWaitOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::DeviceAsyncWaitOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::DeviceAsyncWaitOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		// If numGroup is not present pick 0 as a conservative correct value.
		int32_t numGroups = adaptor.numGroups() ? *adaptor.numGroups() : 0;
		rewriter.create<NVVM::CpAsyncWaitGroupOp>(op.getLoc(), numGroups);
		rewriter.eraseOp(op);
		return success();
		}
		};

} // namespace		} // namespace

void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns) {		RewritePatternSet &patterns) {
patterns.add<MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM>(converter);		patterns.add<MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering>(
		converter);
}		}

std::unique_ptr<Pass> mlir::createConvertNVGPUToNVVMPass() {		std::unique_ptr<Pass> mlir::createConvertNVGPUToNVVMPass() {
return std::make_unique<ConvertNVGPUToNVVMPass>();		return std::make_unique<ConvertNVGPUToNVVMPass>();
}		}

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	bool isLegalToInline(Operation , Region , bool,
BlockAndValueMapping &) const final {		BlockAndValueMapping &) const final {
return true;		return true;
}		}
};		};
} // namespace		} // namespace

void GPUDialect::initialize() {		void GPUDialect::initialize() {
addTypes<AsyncTokenType>();		addTypes<AsyncTokenType>();
addTypes<DeviceAsyncTokenType>();
addTypes<MMAMatrixType>();		addTypes<MMAMatrixType>();
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
>();		>();
addAttributes<		addAttributes<
#define GET_ATTRDEF_LIST		#define GET_ATTRDEF_LIST
#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"
>();		>();
addInterfaces<GPUInlinerInterface>();		addInterfaces<GPUInlinerInterface>();
}		}

Type GPUDialect::parseType(DialectAsmParser &parser) const {		Type GPUDialect::parseType(DialectAsmParser &parser) const {
// Parse the main keyword for the type.		// Parse the main keyword for the type.
StringRef keyword;		StringRef keyword;
if (parser.parseKeyword(&keyword))		if (parser.parseKeyword(&keyword))
return Type();		return Type();
MLIRContext *context = getContext();		MLIRContext *context = getContext();

// Handle 'async token' types.		// Handle 'async token' types.
if (keyword == "async.token")		if (keyword == "async.token")
return AsyncTokenType::get(context);		return AsyncTokenType::get(context);
// Handle 'device async token' types.
if (keyword == "device.async.token")
return DeviceAsyncTokenType::get(context);

if (keyword == "mma_matrix") {		if (keyword == "mma_matrix") {
SMLoc beginLoc = parser.getNameLoc();		SMLoc beginLoc = parser.getNameLoc();

// Parse '<'.		// Parse '<'.
if (parser.parseLess())		if (parser.parseLess())
return nullptr;		return nullptr;

Show All 24 Lines	Type GPUDialect::parseType(DialectAsmParser &parser) const {

parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);		parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);
return Type();		return Type();
}		}

void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {		void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
TypeSwitch<Type>(type)		TypeSwitch<Type>(type)
.Case<AsyncTokenType>([&](Type) { os << "async.token"; })		.Case<AsyncTokenType>([&](Type) { os << "async.token"; })
.Case<DeviceAsyncTokenType>([&](Type) { os << "device.async.token"; })
.Case<MMAMatrixType>([&](MMAMatrixType fragTy) {		.Case<MMAMatrixType>([&](MMAMatrixType fragTy) {
os << "mma_matrix<";		os << "mma_matrix<";
auto shape = fragTy.getShape();		auto shape = fragTy.getShape();
for (auto dim = shape.begin(), e = shape.end() - 1; dim != e; ++dim)		for (auto dim = shape.begin(), e = shape.end() - 1; dim != e; ++dim)
os << *dim << 'x';		os << *dim << 'x';
os << shape.back() << 'x' << fragTy.getElementType();		os << shape.back() << 'x' << fragTy.getElementType();
os << ", \"" << fragTy.getOperand() << "\"" << '>';		os << ", \"" << fragTy.getOperand() << "\"" << '>';
})		})
▲ Show 20 Lines • Show All 1,166 Lines • ▼ Show 20 Lines

} // namespace		} // namespace

void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,		void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,
MLIRContext *context) {		MLIRContext *context) {
results.add<SimplifyDimOfAllocOp>(context);		results.add<SimplifyDimOfAllocOp>(context);
}		}

//===----------------------------------------------------------------------===//
// GPU_DeviceAsyncCopyOp
//===----------------------------------------------------------------------===//

LogicalResult DeviceAsyncCopyOp::verify() {
auto srcMemref = src().getType().cast<MemRefType>();
auto dstMemref = dst().getType().cast<MemRefType>();
unsigned workgroupAddressSpace = GPUDialect::getWorkgroupAddressSpace();
if (!isLastMemrefDimUnitStride(srcMemref))
return emitError("source memref most minor dim must have unit stride");
if (!isLastMemrefDimUnitStride(dstMemref))
return emitError("destination memref most minor dim must have unit stride");
if (dstMemref.getMemorySpaceAsInt() != workgroupAddressSpace)
return emitError("destination memref must have memory space ")
<< workgroupAddressSpace;
if (dstMemref.getElementType() != srcMemref.getElementType())
return emitError("source and destination must have the same element type");
if (size_t(srcMemref.getRank()) != srcIndices().size())
return emitOpError() << "expected " << srcMemref.getRank()
<< " source indices, got " << srcIndices().size();
if (size_t(dstMemref.getRank()) != dstIndices().size())
return emitOpError() << "expected " << dstMemref.getRank()
<< " destination indices, got " << dstIndices().size();
return success();
}

#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"
#include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc"

#define GET_ATTRDEF_CLASSES		#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"

mlir/lib/Dialect/NVGPU/IR/CMakeLists.txt

	add_mlir_dialect_library(MLIRNVGPU			add_mlir_dialect_library(MLIRNVGPU
	NVGPUDialect.cpp			NVGPUDialect.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU

	DEPENDS			DEPENDS
	MLIRNVGPUIncGen			MLIRNVGPUIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
				MLIRGPUOps
	MLIRIR			MLIRIR
	MLIRSideEffectInterfaces			MLIRSideEffectInterfaces
	)			)

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

	//===- NVGPUDialect.cpp - MLIR NVGPU ops implementation -------------------===//			//===- NVGPUDialect.cpp - MLIR NVGPU ops implementation -------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the NVGPU dialect and its operations.			// This file implements the NVGPU dialect and its operations.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/NVGPU/NVGPUDialect.h"			#include "mlir/Dialect/NVGPU/NVGPUDialect.h"
				#include "mlir/Dialect/GPU/GPUDialect.h"
	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
				#include "mlir/IR/DialectImplementation.h"
	#include "mlir/IR/OpImplementation.h"			#include "mlir/IR/OpImplementation.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
				#include "llvm/ADT/TypeSwitch.h"

	using namespace mlir;			using namespace mlir;
				using namespace mlir::nvgpu;

	#include "mlir/Dialect/NVGPU/NVGPUDialect.cpp.inc"			#include "mlir/Dialect/NVGPU/NVGPUDialect.cpp.inc"

	void nvgpu::NVGPUDialect::initialize() {			void nvgpu::NVGPUDialect::initialize() {
				addTypes<DeviceAsyncTokenType>();
	addOperations<			addOperations<
	#define GET_OP_LIST			#define GET_OP_LIST
	#include "mlir/Dialect/NVGPU/NVGPU.cpp.inc"			#include "mlir/Dialect/NVGPU/NVGPU.cpp.inc"
	>();			>();
	}			}

				Type NVGPUDialect::parseType(DialectAsmParser &parser) const {
				// Parse the main keyword for the type.
				StringRef keyword;
				if (parser.parseKeyword(&keyword))
				return Type();
				MLIRContext *context = getContext();
				// Handle 'device async token' types.
				if (keyword == "device.async.token")
				return DeviceAsyncTokenType::get(context);

				parser.emitError(parser.getNameLoc(), "unknown nvgpu type: " + keyword);
				return Type();
				}

				void NVGPUDialect::printType(Type type, DialectAsmPrinter &os) const {
				TypeSwitch<Type>(type)
				.Case<DeviceAsyncTokenType>([&](Type) { os << "device.async.token"; })
				.Default([](Type) { llvm_unreachable("unexpected 'nvgpu' type kind"); });
				}
				//===----------------------------------------------------------------------===//
				// NVGPU_DeviceAsyncCopyOp
				//===----------------------------------------------------------------------===//

				/// Return true if the last dimension of the MemRefType has unit stride. Also
				/// return true for memrefs with no strides.
				static bool isLastMemrefDimUnitStride(MemRefType type) {
				int64_t offset;
				SmallVector<int64_t> strides;
				if (failed(getStridesAndOffset(type, strides, offset))) {
				return false;
				}
				return strides.back() == 1;
				}

				LogicalResult DeviceAsyncCopyOp::verify() {
				auto srcMemref = src().getType().cast<MemRefType>();
				auto dstMemref = dst().getType().cast<MemRefType>();
				unsigned workgroupAddressSpace = gpu::GPUDialect::getWorkgroupAddressSpace();
				if (!isLastMemrefDimUnitStride(srcMemref))
				return emitError("source memref most minor dim must have unit stride");
				if (!isLastMemrefDimUnitStride(dstMemref))
				return emitError("destination memref most minor dim must have unit stride");
				if (dstMemref.getMemorySpaceAsInt() != workgroupAddressSpace)
				return emitError("destination memref must have memory space ")
				<< workgroupAddressSpace;
				if (dstMemref.getElementType() != srcMemref.getElementType())
				return emitError("source and destination must have the same element type");
				if (size_t(srcMemref.getRank()) != srcIndices().size())
				return emitOpError() << "expected " << srcMemref.getRank()
				<< " source indices, got " << srcIndices().size();
				if (size_t(dstMemref.getRank()) != dstIndices().size())
				return emitOpError() << "expected " << dstMemref.getRank()
				<< " destination indices, got " << dstIndices().size();
				return success();
				}

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/NVGPU/NVGPU.cpp.inc"			#include "mlir/Dialect/NVGPU/NVGPU.cpp.inc"

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

Show First 20 Lines • Show All 482 Lines • ▼ Show 20 Lines	gpu.module @test_module {
// CHECK: attributes		// CHECK: attributes
// CHECK: gpu.kernel		// CHECK: gpu.kernel
// CHECK: nvvm.kernel		// CHECK: nvvm.kernel
gpu.func @kernel_func() kernel {		gpu.func @kernel_func() kernel {
gpu.return		gpu.return
}		}
}		}

// -----

gpu.module @test_module {
// CHECK-LABEL: @async_cp(
// CHECK-SAME: %[[IDX:[a-zA-Z0-9_]+]]: i64)
gpu.func @async_cp(
%src: memref<128x128xf32>, %dst: memref<3x16x128xf32, 3>, %i : index) kernel {
// CHECK-DAG: %[[BASEDST:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32, 3>, ptr<f32, 3>, i64, array<3 x i64>, array<3 x i64>)>
// CHECK-DAG: %[[S0:.*]] = llvm.mlir.constant(2048 : index) : i64
// CHECK-DAG: %[[LI:.*]] = llvm.mul %[[IDX]], %[[S0]] : i64
// CHECK-DAG: %[[S1:.*]] = llvm.mlir.constant(128 : index) : i64
// CHECK-DAG: %[[FI0:.*]] = llvm.mul %[[IDX]], %[[S1]] : i64
// CHECK-DAG: %[[FI1:.*]] = llvm.add %[[LI]], %[[FI0]] : i64
// CHECK-DAG: %[[FI2:.*]] = llvm.add %[[FI1]], %[[IDX]] : i64
// CHECK-DAG: %[[ADDRESSDST:.*]] = llvm.getelementptr %[[BASEDST]][%[[FI2]]] : (!llvm.ptr<f32, 3>, i64) -> !llvm.ptr<f32, 3>
// CHECK-DAG: %[[CAST0:.*]] = llvm.bitcast %[[ADDRESSDST]] : !llvm.ptr<f32, 3> to !llvm.ptr<i8, 3>
// CHECK-DAG: %[[BASESRC:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
// CHECK-DAG: %[[S3:.*]] = llvm.mlir.constant(128 : index) : i64
// CHECK-DAG: %[[FI3:.*]] = llvm.mul %[[IDX]], %[[S3]] : i64
// CHECK-DAG: %[[FI4:.*]] = llvm.add %[[FI3]], %[[IDX]] : i64
// CHECK-DAG: %[[ADDRESSSRC:.*]] = llvm.getelementptr %[[BASESRC]][%[[FI4]]] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
// CHECK-DAG: %[[CAST1:.*]] = llvm.bitcast %[[ADDRESSSRC]] : !llvm.ptr<f32> to !llvm.ptr<i8>
// CHECK-DAG: %[[CAST2:.*]] = llvm.addrspacecast %[[CAST1]] : !llvm.ptr<i8> to !llvm.ptr<i8, 1>
// CHECK-DAG: nvvm.cp.async.shared.global %[[CAST0]], %[[CAST2]], 16
%0 = gpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
// CHECK: nvvm.cp.async.commit.group
%1 = gpu.device_async_create_group %0
// CHECK: nvvm.cp.async.wait.group 1
gpu.device_async_wait %1 { numGroups = 1 : i32 }
gpu.return
}
}

mlir/test/Conversion/NVGPUToNVVM/mma-sync-to-nvvm.mlir

This file was moved to mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir.

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

This file was moved from mlir/test/Conversion/NVGPUToNVVM/mma-sync-to-nvvm.mlir.

	Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
	func.func @ldmatrix_x1(%arg0: memref<128x128xf16, 3>) -> vector<1x2xf16> {			func.func @ldmatrix_x1(%arg0: memref<128x128xf16, 3>) -> vector<1x2xf16> {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	// CHECK: nvvm.ldmatrix {{%.+}} {layout = #nvvm.mma_layout<row>, num = 1 : i32} {{.*}} -> i32			// CHECK: nvvm.ldmatrix {{%.+}} {layout = #nvvm.mma_layout<row>, num = 1 : i32} {{.*}} -> i32
	%a = nvgpu.ldmatrix %arg0[%c0, %c0] {transpose = false, numTiles = 1 : i32} : memref<128x128xf16, 3> -> vector<1x2xf16>			%a = nvgpu.ldmatrix %arg0[%c0, %c0] {transpose = false, numTiles = 1 : i32} : memref<128x128xf16, 3> -> vector<1x2xf16>
	// CHECK: llvm.bitcast			// CHECK: llvm.bitcast
	// CHECK: llvm.insertvalue			// CHECK: llvm.insertvalue
	return %a : vector<1x2xf16>			return %a : vector<1x2xf16>
	}			}


				// -----

				// CHECK-LABEL: @async_cp(
				// CHECK-SAME: %[[IDX:[a-zA-Z0-9_]+]]: index)
				func.func @async_cp(
				%src: memref<128x128xf32>, %dst: memref<3x16x128xf32, 3>, %i : index) {
				// CHECK: %[[IDX1:.*]] = builtin.unrealized_conversion_cast %[[IDX]] : index to i64
				// CHECK-DAG: %[[BASEDST:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32, 3>, ptr<f32, 3>, i64, array<3 x i64>, array<3 x i64>)>
				// CHECK-DAG: %[[S0:.*]] = llvm.mlir.constant(2048 : index) : i64
				// CHECK-DAG: %[[LI:.*]] = llvm.mul %[[IDX1]], %[[S0]] : i64
				// CHECK-DAG: %[[S1:.*]] = llvm.mlir.constant(128 : index) : i64
				// CHECK-DAG: %[[FI0:.*]] = llvm.mul %[[IDX1]], %[[S1]] : i64
				// CHECK-DAG: %[[FI1:.*]] = llvm.add %[[LI]], %[[FI0]] : i64
				// CHECK-DAG: %[[FI2:.*]] = llvm.add %[[FI1]], %[[IDX1]] : i64
				// CHECK-DAG: %[[ADDRESSDST:.*]] = llvm.getelementptr %[[BASEDST]][%[[FI2]]] : (!llvm.ptr<f32, 3>, i64) -> !llvm.ptr<f32, 3>
				// CHECK-DAG: %[[CAST0:.*]] = llvm.bitcast %[[ADDRESSDST]] : !llvm.ptr<f32, 3> to !llvm.ptr<i8, 3>
				// CHECK-DAG: %[[BASESRC:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
				// CHECK-DAG: %[[S3:.*]] = llvm.mlir.constant(128 : index) : i64
				// CHECK-DAG: %[[FI3:.*]] = llvm.mul %[[IDX1]], %[[S3]] : i64
				// CHECK-DAG: %[[FI4:.*]] = llvm.add %[[FI3]], %[[IDX1]] : i64
				// CHECK-DAG: %[[ADDRESSSRC:.*]] = llvm.getelementptr %[[BASESRC]][%[[FI4]]] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
				// CHECK-DAG: %[[CAST1:.*]] = llvm.bitcast %[[ADDRESSSRC]] : !llvm.ptr<f32> to !llvm.ptr<i8>
				// CHECK-DAG: %[[CAST2:.*]] = llvm.addrspacecast %[[CAST1]] : !llvm.ptr<i8> to !llvm.ptr<i8, 1>
				// CHECK-DAG: nvvm.cp.async.shared.global %[[CAST0]], %[[CAST2]], 16
				%0 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 : memref<128x128xf32> to memref<3x16x128xf32, 3>
				// CHECK: nvvm.cp.async.commit.group
				%1 = nvgpu.device_async_create_group %0
				// CHECK: nvvm.cp.async.wait.group 1
				nvgpu.device_async_wait %1 { numGroups = 1 : i32 }

				// CHECK: nvvm.cp.async.shared.global %{{.}}, %{{.}}, 16 {bypass_l1}
				%2 = nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 4 {bypassL1}: memref<128x128xf32> to memref<3x16x128xf32, 3>
				return
				}

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 553 Lines • ▼ Show 20 Lines
	func.func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {			func.func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
	// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}			// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}
	%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">			%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
	return			return
	}			}

	// -----			// -----

	func.func @async_cp_memory_space(%dst : memref<16xf32>, %src : memref<16xf32>, %i : index) -> () {
	// expected-error @+1 {{destination memref must have memory space 3}}
	gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xf32>
	return
	}

	// -----

	func.func @async_cp_memref_type(%dst : memref<16xi32, 3>, %src : memref<16xf32>, %i : index) -> () {
	// expected-error @+1 {{source and destination must have the same element type}}
	gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xi32, 3>
	return
	}

	// -----

	func.func @async_cp_num_src_indices(%dst : memref<16xf32, 3>, %src : memref<16x16xf32>, %i : index) -> () {
	// expected-error @+1 {{expected 2 source indices, got 1}}
	gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16x16xf32> to memref<16xf32, 3>
	return
	}

	// -----

	func.func @async_cp_num_dst_indices(%dst : memref<16x16xf32, 3>, %src : memref<16xf32>, %i : index) -> () {
	// expected-error @+1 {{expected 2 destination indices, got 1}}
	gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16x16xf32, 3>
	return
	}

	// -----

	func.func @async_cp_num_src_stride(
	%dst : memref<200x100xf32, 3>,
	%src : memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>>,
	%i : index) -> () {
	// expected-error @+1 {{source memref most minor dim must have unit stride}}
	gpu.device_async_copy %src[%i, %i], %dst[%i, %i], 16 :
	memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>> to memref<200x100xf32, 3>
	return
	}

	// -----

	func.func @async_cp_num_dst_stride(
	%dst : memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>, 3>,
	%src : memref<200x100xf32>,
	%i : index) -> () {
	// expected-error @+1 {{destination memref most minor dim must have unit stride}}
	gpu.device_async_copy %src[%i, %i], %dst[%i, %i], 16 :
	memref<200x100xf32> to memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>, 3>
	return
	}

	// -----

	// Number of symbol operand count less than memref symbol count.			// Number of symbol operand count less than memref symbol count.
	func.func @alloc() {			func.func @alloc() {
	// expected-error@+1 {{symbol operand count does not equal memref symbol count}}			// expected-error@+1 {{symbol operand count does not equal memref symbol count}}
	%1 = gpu.alloc() : memref<2x4xf32, affine_map<(d0, d1)[s0] -> ((d0 + s0), d1)>, 1>			%1 = gpu.alloc() : memref<2x4xf32, affine_map<(d0, d1)[s0] -> ((d0 + s0), d1)>, 1>
	return			return
	}			}

	// -----			// -----
	Show All 28 Lines

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines	func.func @mmamatrix_valid_element_type(%src : memref<32x32xf16, affine_map<(d0, d1) -> (d0 * 64 + d1)>>){
%1 = gpu.subgroup_mma_constant_matrix %cst : !gpu.mma_matrix<16x16xf32, "COp">		%1 = gpu.subgroup_mma_constant_matrix %cst : !gpu.mma_matrix<16x16xf32, "COp">
// CHECK: gpu.subgroup_mma_elementwise addf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		// CHECK: gpu.subgroup_mma_elementwise addf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
%2 = gpu.subgroup_mma_elementwise addf %1, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		%2 = gpu.subgroup_mma_elementwise addf %1, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
// CHECK: gpu.subgroup_mma_elementwise maxf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		// CHECK: gpu.subgroup_mma_elementwise maxf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
%3 = gpu.subgroup_mma_elementwise maxf %2, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		%3 = gpu.subgroup_mma_elementwise maxf %2, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
return		return
}		}

func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
// CHECK-LABEL: func @async_cp
%c0 = arith.constant 0 : index
// CHECK: gpu.device_async_copy %{{.}}[{{.}}, {{.}}], %{{.}}[{{.}}, {{.}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
%0 = gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
// CHECK: %{{.*}} = gpu.device_async_create_group
%token = gpu.device_async_create_group %0
// CHECK: gpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
gpu.device_async_wait %token {numGroups = 1 : i32}
return
}

// CHECK-LABEL: func @set_default_device		// CHECK-LABEL: func @set_default_device
func.func @set_default_device(%arg0: i32) {		func.func @set_default_device(%arg0: i32) {
// CHECK: gpu.set_default_device		// CHECK: gpu.set_default_device
gpu.set_default_device %arg0		gpu.set_default_device %arg0
return		return
}		}
}		}

mlir/test/Dialect/NVGPU/invalid.mlir

This file was added.

				// RUN: mlir-opt -split-input-file -verify-diagnostics %s

				func.func @async_cp_memory_space(%dst : memref<16xf32>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{destination memref must have memory space 3}}
				nvgpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xf32>
				return
				}

				// -----

				func.func @async_cp_memref_type(%dst : memref<16xi32, 3>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{source and destination must have the same element type}}
				nvgpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xi32, 3>
				return
				}

				// -----

				func.func @async_cp_num_src_indices(%dst : memref<16xf32, 3>, %src : memref<16x16xf32>, %i : index) -> () {
				// expected-error @+1 {{expected 2 source indices, got 1}}
				nvgpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16x16xf32> to memref<16xf32, 3>
				return
				}

				// -----

				func.func @async_cp_num_dst_indices(%dst : memref<16x16xf32, 3>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{expected 2 destination indices, got 1}}
				nvgpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16x16xf32, 3>
				return
				}

				// -----

				func.func @async_cp_num_src_stride(
				%dst : memref<200x100xf32, 3>,
				%src : memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>>,
				%i : index) -> () {
				// expected-error @+1 {{source memref most minor dim must have unit stride}}
				nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i], 16 :
				memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>> to memref<200x100xf32, 3>
				return
				}

				// -----

				func.func @async_cp_num_dst_stride(
				%dst : memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>, 3>,
				%src : memref<200x100xf32>,
				%i : index) -> () {
				// expected-error @+1 {{destination memref most minor dim must have unit stride}}
				nvgpu.device_async_copy %src[%i, %i], %dst[%i, %i], 16 :
				memref<200x100xf32> to memref<200x100xf32, affine_map<(d0, d1) -> (200d0 + 2d1)>, 3>
				return
				}

mlir/test/Dialect/NVGPU/roundtrip.mlir

	// RUN: mlir-opt %s \| mlir-opt \| FileCheck %s			// RUN: mlir-opt %s \| mlir-opt \| FileCheck %s

	// CHECK-LABEL: func @ldmatrix(			// CHECK-LABEL: func @ldmatrix(
	func.func @ldmatrix(%arg0: memref<?x?xf16, 3>, %x: index, %y: index) {			func.func @ldmatrix(%arg0: memref<?x?xf16, 3>, %x: index, %y: index) {
	// CHECK: nvgpu.ldmatrix %{{.}}[%{{.}}, %{{.*}}]			// CHECK: nvgpu.ldmatrix %{{.}}[%{{.}}, %{{.*}}]
	// CHECK-SAME: {numTiles = 4 : i32, transpose = false} : memref<?x?xf16, 3> -> vector<4x2xf16>			// CHECK-SAME: {numTiles = 4 : i32, transpose = false} : memref<?x?xf16, 3> -> vector<4x2xf16>
	%l = nvgpu.ldmatrix %arg0[%x, %y] {numTiles = 4 : i32, transpose = false} :			%l = nvgpu.ldmatrix %arg0[%x, %y] {numTiles = 4 : i32, transpose = false} :
	memref<?x?xf16, 3> -> vector<4x2xf16>			memref<?x?xf16, 3> -> vector<4x2xf16>
	return			return
	}			}

	// CHECK-LABEL: func @mma_sync(			// CHECK-LABEL: func @mma_sync(
	func.func @mma_sync(%arg0: vector<4x2xf16>,			func.func @mma_sync(%arg0: vector<4x2xf16>,
	%arg1: vector<2x2xf16>,			%arg1: vector<2x2xf16>,
	%arg2: vector<2x2xf16>) -> vector<2x2xf16> {			%arg2: vector<2x2xf16>) -> vector<2x2xf16> {
	// CHECK: nvgpu.mma.sync(%{{.}}, %{{.}}, %{{.*}}) {mmaShape = [16, 8, 16]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>			// CHECK: nvgpu.mma.sync(%{{.}}, %{{.}}, %{{.*}}) {mmaShape = [16, 8, 16]} : (vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
	%d = nvgpu.mma.sync(%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} :			%d = nvgpu.mma.sync(%arg0, %arg1, %arg2) {mmaShape = [16, 8, 16]} :
	(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>			(vector<4x2xf16>, vector<2x2xf16>, vector<2x2xf16>) -> vector<2x2xf16>
	return %d : vector<2x2xf16>			return %d : vector<2x2xf16>
	}			}


				func.func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
				// CHECK-LABEL: func @async_cp
				%c0 = arith.constant 0 : index
				// CHECK: nvgpu.device_async_copy %{{.}}[{{.}}, {{.}}], %{{.}}[{{.}}, {{.}}, {{.*}}], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
				%0 = nvgpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 4 : memref<4x5xf32> to memref<2x7x5xf32, 3>
				// CHECK: %{{.*}} = nvgpu.device_async_create_group
				%token = nvgpu.device_async_create_group %0
				// CHECK: nvgpu.device_async_wait %{{.*}} {numGroups = 1 : i32}
				nvgpu.device_async_wait %token {numGroups = 1 : i32}
				return
				}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Show First 20 Lines • Show All 2,059 Lines • ▼ Show 20 Lines
)		)

cc_library(		cc_library(
name = "NVGPU",		name = "NVGPU",
srcs = ["lib/Dialect/NVGPU/IR/NVGPUDialect.cpp"],		srcs = ["lib/Dialect/NVGPU/IR/NVGPUDialect.cpp"],
hdrs = ["include/mlir/Dialect/NVGPU/NVGPUDialect.h"],		hdrs = ["include/mlir/Dialect/NVGPU/NVGPUDialect.h"],
includes = ["include"],		includes = ["include"],
deps = [		deps = [
		":GPUDialect",
":IR",		":IR",
":NVGPUIncGen",		":NVGPUIncGen",
":SideEffectInterfaces",		":SideEffectInterfaces",
"//llvm:Core",		"//llvm:Core",
"//llvm:Support",		"//llvm:Support",
],		],
)		)

▲ Show 20 Lines • Show All 1,548 Lines • ▼ Show 20 Lines	srcs = glob([
"lib/Conversion/NVGPUToNVVM/*.h",		"lib/Conversion/NVGPUToNVVM/*.h",
]) + [":ConversionPassDetail"],		]) + [":ConversionPassDetail"],
hdrs = glob([		hdrs = glob([
"include/mlir/Conversion/NVGPUToNVVM/*.h",		"include/mlir/Conversion/NVGPUToNVVM/*.h",
]),		]),
includes = ["include"],		includes = ["include"],
deps = [		deps = [
":ConversionPassIncGen",		":ConversionPassIncGen",
		":GPUDialect",
":IR",		":IR",
":LLVMCommonConversion",		":LLVMCommonConversion",
":NVGPU",		":NVGPU",
":NVVMDialect",		":NVVMDialect",
":Pass",		":Pass",
":Transforms",		":Transforms",
"//llvm:Support",		"//llvm:Support",
],		],
▲ Show 20 Lines • Show All 5,240 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Move async copy ops to NVGPU and add caching hintsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 428128

mlir/include/mlir/Dialect/GPU/GPUBase.td

mlir/include/mlir/Dialect/GPU/GPUDialect.h

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/Dialect/NVGPU/NVGPU.td

mlir/include/mlir/Dialect/NVGPU/NVGPUDialect.h

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Conversion/NVGPUToNVVM/CMakeLists.txt

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/NVGPU/IR/CMakeLists.txt

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

mlir/test/Conversion/NVGPUToNVVM/mma-sync-to-nvvm.mlir

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

mlir/test/Dialect/NVGPU/invalid.mlir

mlir/test/Dialect/NVGPU/roundtrip.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][gpu] Move async copy ops to NVGPU and add caching hints
ClosedPublic