This is an archive of the discontinued LLVM Phabricator instance.

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td
32	at the what?
34–35	Why not call it `bypass_l1` then?
39	It feels like a footgun: the target may not itself be erased, but nested ops definitely are and we are not invalidating handle to those. I'd rather make this consume the target and produce a new handle for chaining.
mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
26–30	Nit: can this be extracted into a template and made common with the similar code below?
53	Nit: could this elaborate what happens with 2d masks right now?
60	Nit: "legal" is a bit too generic here. I'd rather say something like "currently supported by async copy".
96	Nit: I believe `walk` can take a lambda returning `void` so we don't need to have `WalkResult::advance()` everywhere.
98	Nit: no need to prefix with `llvm::`, here and below.
191	For future: this should be really using the data layout mechanism, which would also fix the todo related to alignment above.
192–193	I'd consider returning a (silenceable) failure if `useMMASync` is set and cannot be achieved. Or at least documenting this attribute as a hint rather than a guarantee.

This revision is now accepted and ready to land.Jul 18 2023, 2:47 AM

springerm marked 9 inline comments as done.Jul 18 2023, 5:37 AM

This revision was landed with ongoing or failed builds.Jul 18 2023, 5:37 AM

Closed by commit rGdb393288ff9b: [mlir][NVGPU][transform] Add `create_async_groups` transform op (authored by springerm). · Explain Why

This revision was automatically updated to reflect the committed changes.

springerm added a commit: rGdb393288ff9b: [mlir][NVGPU][transform] Add `create_async_groups` transform op.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

NVGPU/

TransformOps/

NVGPUTransformOps.td

45 lines

Transforms/

Transforms.h

9 lines

Utils.h

7 lines

Vector/

IR/

VectorOps.td

7 lines

lib/

Dialect/

NVGPU/

TransformOps/

CMakeLists.txt

1 line

NVGPUTransformOps.cpp

20 lines

Transforms/

CMakeLists.txt

1 line

CreateAsyncGroups.cpp

218 lines

Utils.cpp

34 lines

test/

Dialect/

NVGPU/

transform-create-async-groups.mlir

153 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

1 line

Diff 541476

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

	Show All 10 Lines

	include "mlir/Dialect/Transform/IR/TransformAttrs.td"			include "mlir/Dialect/Transform/IR/TransformAttrs.td"
	include "mlir/Dialect/Transform/IR/TransformDialect.td"			include "mlir/Dialect/Transform/IR/TransformDialect.td"
	include "mlir/Dialect/Transform/IR/TransformInterfaces.td"			include "mlir/Dialect/Transform/IR/TransformInterfaces.td"
	include "mlir/Dialect/Transform/IR/TransformTypes.td"			include "mlir/Dialect/Transform/IR/TransformTypes.td"
	include "mlir/Interfaces/SideEffectInterfaces.td"			include "mlir/Interfaces/SideEffectInterfaces.td"

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// CreateAsyncGroupsOp
				//===----------------------------------------------------------------------===//

				def CreateAsyncGroupsOp :
				Op<Transform_Dialect, "nvgpu.create_async_groups",
				[DeclareOpInterfaceMethods<MemoryEffectsOpInterface>,
				TransformEachOpTrait,
				TransformOpInterface,
				ReportTrackingListenerFailuresOpTrait]> {
				let description = [{
				Look for global to shared memory copies within the targeted op in the form
				of vector transfer ops and convert them to async copies when possible.
				Consecutive copies are put into the same group. A "wait" operation is
				inserted right at the of end the group.
				ftynseUnsubmitted Done Reply Inline Actions at the what? ftynse: at the what?

				`bypass_l1` specifies whether `bypassL1` attributes should be added to
				the async copies. `bypass_l1` is a compiler hint: only 16 byte transfers
				ftynseUnsubmitted Done Reply Inline Actions Why not call it `bypass_l1` then? ftynse: Why not call it `bypass_l1` then?
				can bypass the L1 cache, so this attribute is not set for any other transfer
				sizes.

				#### Return modes
				ftynseUnsubmitted Done Reply Inline Actions It feels like a footgun: the target may not itself be erased, but nested ops definitely are and we are not invalidating handle to those. I'd rather make this consume the target and produce a new handle for chaining. ftynse: It feels like a footgun: the target may not itself be erased, but nested ops definitely are and…

				This op consumes the `target` handle and produces the `result` handle, which
				is mapped to the same payload operations as the `target` handle. The op
				modifies the payload.
				}];

				let arguments = (ins TransformHandleTypeInterface:$target,
				UnitAttr:$bypass_l1);
				let results = (outs TransformHandleTypeInterface:$result);

				let assemblyFormat = [{
				$target attr-dict `:` functional-type(operands, results)
				}];

				let extraClassDeclaration = [{
				::mlir::DiagnosedSilenceableFailure applyToOne(
				::mlir::transform::TransformRewriter &rewriter,
				::mlir::Operation *target,
				::mlir::transform::ApplyToEachResultList &results,
				::mlir::transform::TransformState &state);
				}];
				}

				//===----------------------------------------------------------------------===//
	// PipelineSharedMemoryCopiesOp			// PipelineSharedMemoryCopiesOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def PipelineSharedMemoryCopiesOp :			def PipelineSharedMemoryCopiesOp :
	Op<Transform_Dialect, "nvgpu.pipeline_shared_memory_copies",			Op<Transform_Dialect, "nvgpu.pipeline_shared_memory_copies",
	[FunctionalStyleTransformOpTrait,			[FunctionalStyleTransformOpTrait,
	MemoryEffectsOpInterface,			MemoryEffectsOpInterface,
	TransformEachOpTrait,			TransformEachOpTrait,
	▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h

	Show All 11 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_			#ifndef MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_
	#define MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_			#define MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_

	#include "mlir/IR/Operation.h"			#include "mlir/IR/Operation.h"
	#include "mlir/Support/LogicalResult.h"			#include "mlir/Support/LogicalResult.h"

	namespace mlir {			namespace mlir {
				class RewriterBase;

	namespace nvgpu {			namespace nvgpu {

	///			///
	/// Passes			/// Passes
	///			///

	/// Optimizes vectorized accesses to a shared memory buffer specified by			/// Optimizes vectorized accesses to a shared memory buffer specified by
	/// memrefValue. This transformation assumes the following:			/// memrefValue. This transformation assumes the following:
	Show All 35 Lines
	/// of accuracy from missing precision bits. While f32 has 23 precision			/// of accuracy from missing precision bits. While f32 has 23 precision
	/// bits, tf32 has only 10 precision bits. tf32x3 aims to recover the			/// bits, tf32 has only 10 precision bits. tf32x3 aims to recover the
	/// precision bits by spliting each operand into two tf32 values			/// precision bits by spliting each operand into two tf32 values
	/// and issue three mma.sync tensor core operations.			/// and issue three mma.sync tensor core operations.
	void populateMmaSyncF32ToTF32Patterns(			void populateMmaSyncF32ToTF32Patterns(
	RewritePatternSet &patterns,			RewritePatternSet &patterns,
	nvgpu::MmaSyncF32Lowering precision = nvgpu::MmaSyncF32Lowering::TF32);			nvgpu::MmaSyncF32Lowering precision = nvgpu::MmaSyncF32Lowering::TF32);

				/// Convert global->shared vector transfers to async device copies. This
				/// function looks for suitable vector transfers within the specified op and
				/// converts them to "nvgpu.device_async_copy" ops. Consecutive copies are put
				/// into the same sync group. If `bypassL1` is set, the "bypassL1" attribute is
				/// set for suitable (i.e., transfer size 16 bytes) transfers.
				void createAsyncGroups(RewriterBase &rewriter, Operation *op, bool bypassL1);

	} // namespace nvgpu			} // namespace nvgpu
	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_			#endif // MLIR_DIALECT_NVGPU_TRANSFORMS_TRANSFORMS_H_

mlir/include/mlir/Dialect/NVGPU/Transforms/Utils.h

	Show All 11 Lines
	namespace nvgpu {			namespace nvgpu {

	/// Get the indices that the given load/store operation is operating on.			/// Get the indices that the given load/store operation is operating on.
	Operation::operand_range getIndices(Operation *op);			Operation::operand_range getIndices(Operation *op);

	/// Set the indices that the given load/store operation is operating on.			/// Set the indices that the given load/store operation is operating on.
	void setIndices(Operation *op, ArrayRef<Value> indices);			void setIndices(Operation *op, ArrayRef<Value> indices);

				/// Get the value that is stored by the given store operation.
				Value getValueStored(Operation *op);

				/// Get the memref that is loaded from/stored into by the given load/store
				/// operation.
				Value getMemrefOperand(Operation *op);

	} // namespace nvgpu			} // namespace nvgpu
	} // namespace mlir			} // namespace mlir

mlir/include/mlir/Dialect/Vector/IR/VectorOps.td

Show First 20 Lines • Show All 2,305 Lines • ▼ Show 20 Lines	let description = [{
```		```
}];		}];

let builders = [		let builders = [
// Build with mixed static/dynamic operands.		// Build with mixed static/dynamic operands.
OpBuilder<(ins "VectorType":$type, "ArrayRef<OpFoldResult>":$mixedOperands)>		OpBuilder<(ins "VectorType":$type, "ArrayRef<OpFoldResult>":$mixedOperands)>
];		];

		let extraClassDeclaration = [{
		/// Return the result type of this op.
		VectorType getVectorType() {
		return cast<VectorType>(getOperation()->getResultTypes()[0]);
		}
		}];

let hasCanonicalizer = 1;		let hasCanonicalizer = 1;
let hasVerifier = 1;		let hasVerifier = 1;
let assemblyFormat = "$operands attr-dict `:` type(results)";		let assemblyFormat = "$operands attr-dict `:` type(results)";
}		}

def Vector_MaskOp : Vector_Op<"mask", [		def Vector_MaskOp : Vector_Op<"mask", [
SingleBlockImplicitTerminator<"vector::YieldOp">,		SingleBlockImplicitTerminator<"vector::YieldOp">,
DeclareOpInterfaceMethods<MaskingOpInterface>,		DeclareOpInterfaceMethods<MaskingOpInterface>,
▲ Show 20 Lines • Show All 572 Lines • Show Last 20 Lines

mlir/lib/Dialect/NVGPU/TransformOps/CMakeLists.txt

	add_mlir_dialect_library(MLIRNVGPUTransformOps			add_mlir_dialect_library(MLIRNVGPUTransformOps
	NVGPUTransformOps.cpp			NVGPUTransformOps.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU/TransformOps			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU/TransformOps

	DEPENDS			DEPENDS
	MLIRNVGPUTransformOpsIncGen			MLIRNVGPUTransformOpsIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRAffineDialect			MLIRAffineDialect
	MLIRArithDialect			MLIRArithDialect
	MLIRIR			MLIRIR
	MLIRLinalgDialect			MLIRLinalgDialect
	MLIRNVGPUDialect			MLIRNVGPUDialect
				MLIRNVGPUTransforms
	MLIRParser			MLIRParser
	MLIRSideEffectInterfaces			MLIRSideEffectInterfaces
	MLIRSCFDialect			MLIRSCFDialect
	MLIRSCFTransforms			MLIRSCFTransforms
	MLIRTransformDialect			MLIRTransformDialect
	MLIRTransformDialectUtils			MLIRTransformDialectUtils
	MLIRVectorTransforms			MLIRVectorTransforms
	)			)

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

	Show All 10 Lines
	#include "mlir/Analysis/SliceAnalysis.h"			#include "mlir/Analysis/SliceAnalysis.h"
	#include "mlir/Dialect/Affine/IR/AffineOps.h"			#include "mlir/Dialect/Affine/IR/AffineOps.h"
	#include "mlir/Dialect/Arith/IR/Arith.h"			#include "mlir/Dialect/Arith/IR/Arith.h"
	#include "mlir/Dialect/Arith/Utils/Utils.h"			#include "mlir/Dialect/Arith/Utils/Utils.h"
	#include "mlir/Dialect/GPU/IR/GPUDialect.h"			#include "mlir/Dialect/GPU/IR/GPUDialect.h"
	#include "mlir/Dialect/Linalg/IR/Linalg.h"			#include "mlir/Dialect/Linalg/IR/Linalg.h"
	#include "mlir/Dialect/MemRef/IR/MemRef.h"			#include "mlir/Dialect/MemRef/IR/MemRef.h"
	#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"			#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
				#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"
	#include "mlir/Dialect/SCF/IR/SCF.h"			#include "mlir/Dialect/SCF/IR/SCF.h"
	#include "mlir/Dialect/SCF/Transforms/Transforms.h"			#include "mlir/Dialect/SCF/Transforms/Transforms.h"
	#include "mlir/Dialect/Utils/IndexingUtils.h"			#include "mlir/Dialect/Utils/IndexingUtils.h"
	#include "mlir/Dialect/Vector/IR/VectorOps.h"			#include "mlir/Dialect/Vector/IR/VectorOps.h"
	#include "mlir/IR/AffineExpr.h"			#include "mlir/IR/AffineExpr.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/MLIRContext.h"			#include "mlir/IR/MLIRContext.h"
	#include "mlir/IR/Operation.h"			#include "mlir/IR/Operation.h"
	#include "mlir/IR/TypeRange.h"			#include "mlir/IR/TypeRange.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
	#include "mlir/Support/LogicalResult.h"			#include "mlir/Support/LogicalResult.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::linalg;			using namespace mlir::linalg;
	using namespace mlir::nvgpu;			using namespace mlir::nvgpu;
	using namespace mlir::transform;			using namespace mlir::transform;

	#define DEBUG_TYPE "nvgpu-transforms"			#define DEBUG_TYPE "nvgpu-transforms"
	#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")			#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
	#define DBGSNL() (llvm::dbgs() << "\n")			#define DBGSNL() (llvm::dbgs() << "\n")
	#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")			#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

				//===---------------------------------------------------------------------===//
				// CreateAsyncGroupsOp
				//===---------------------------------------------------------------------===//

				void transform::CreateAsyncGroupsOp::getEffects(
				SmallVectorImpl<MemoryEffects::EffectInstance> &effects) {
				transform::consumesHandle(getTarget(), effects);
				transform::producesHandle(getResult(), effects);
				transform::modifiesPayload(effects);
				}

				DiagnosedSilenceableFailure transform::CreateAsyncGroupsOp::applyToOne(
				TransformRewriter &rewriter, Operation *target,
				ApplyToEachResultList &results, TransformState &state) {
				nvgpu::createAsyncGroups(rewriter, target, getBypassL1());
				results.push_back(target);
				return DiagnosedSilenceableFailure::success();
				}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// PipelineSharedMemoryCopiesOp			// PipelineSharedMemoryCopiesOp
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	/// Returns true if the given type has the default memory space.			/// Returns true if the given type has the default memory space.
	static bool hasDefaultMemorySpace(BaseMemRefType type) {			static bool hasDefaultMemorySpace(BaseMemRefType type) {
	return !type.getMemorySpace() \|\| type.getMemorySpaceAsInt() == 0;			return !type.getMemorySpace() \|\| type.getMemorySpaceAsInt() == 0;
	}			}
	▲ Show 20 Lines • Show All 714 Lines • Show Last 20 Lines

mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRNVGPUTransforms			add_mlir_dialect_library(MLIRNVGPUTransforms
				CreateAsyncGroups.cpp
	OptimizeSharedMemory.cpp			OptimizeSharedMemory.cpp
	MmaSyncTF32Transform.cpp			MmaSyncTF32Transform.cpp
	Utils.cpp			Utils.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/NVGPU

	DEPENDS			DEPENDS
	Show All 13 Lines

mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp

This file was added.

				//===- CreateAsyncGroups.cpp - Create async device copies -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"

				#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
				#include "mlir/Dialect/NVGPU/Transforms/Utils.h"
				#include "mlir/Dialect/Vector/IR/VectorOps.h"
				#include "mlir/IR/BuiltinAttributes.h"
				#include "mlir/IR/BuiltinTypes.h"

				using namespace mlir;

				/// Return "true" if the given vector transfer op is contiguous and suitable
				/// for replacement with an async copy.
				template <typename OpTy>
				static bool isContiguousXferOp(OpTy op) {
				return op.getPermutationMap().isMinorIdentity() && op.isDimInBounds(0) &&
				op.hasBufferSemantics() &&
				isLastMemrefDimUnitStride(
				cast<MemRefType>(nvgpu::getMemrefOperand(op).getType()));
				}

				ftynseUnsubmitted Done Reply Inline Actions Nit: can this be extracted into a template and made common with the similar code below? ftynse: Nit: can this be extracted into a template and made common with the similar code below?
				/// Return "true" if the given op is a contiguous and suitable
				/// vector.transfer_write or vector.store op.
				static bool isContiguousStore(Operation *write) {
				if (auto transferWrite = dyn_cast<vector::TransferWriteOp>(write))
				return isContiguousXferOp(transferWrite) && !transferWrite.getMask();
				// vector.store are always contiguous.
				return isa<vector::StoreOp>(write);
				}

				/// Return "true" if the given op is a contiguous and suitable
				/// vector.transfer_read or vector.load op.
				static bool isContiguousRead(Operation *read) {
				if (auto transferRead = dyn_cast<vector::TransferReadOp>(read))
				return isContiguousXferOp(transferRead);
				// vector.load are always contiguous.
				return isa<vector::LoadOp>(read);
				}

				/// If the given vector load op has a mask that is defined by
				/// vector.create_mask, return that op.
				static vector::CreateMaskOp getMaskOp(Operation *loadOp) {
				auto transferRead = dyn_cast<vector::TransferReadOp>(loadOp);
				if (!transferRead \|\| !transferRead.getMask())
				ftynseUnsubmitted Done Reply Inline Actions Nit: could this elaborate what happens with 2d masks right now? ftynse: Nit: could this elaborate what happens with 2d masks right now?
				return {};
				auto maskOp = transferRead.getMask().getDefiningOp<vector::CreateMaskOp>();
				// TODO: Support 2D masks and higher. Ops with a >1D mask are ignored at the
				// moment.
				if (maskOp.getVectorType().getRank() != 1)
				return {};
				return maskOp;
				ftynseUnsubmitted Done Reply Inline Actions Nit: "legal" is a bit too generic here. I'd rather say something like "currently supported by async copy". ftynse: Nit: "legal" is a bit too generic here. I'd rather say something like "currently supported by…
				}

				/// Return "true" if the conversion to async copy is supported by "async copy".
				static bool resultsInSupportedAsyncCopy(MemRefType memrefType,
				Operation::operand_range indices,
				VectorType vecType) {
				assert(vecType.getRank() == 1 && "expected 1-D vector");
				constexpr int64_t kSupportedCpAsyncAlignmentsInBytes[3] = {4, 8, 16};

				// Condition 1: the copy size must be supported.
				bool supportedCopySize = false;
				int64_t numElements = vecType.getNumElements();
				Type elementType = vecType.getElementType();
				for (int64_t alignmentInBytes : kSupportedCpAsyncAlignmentsInBytes) {
				if (alignmentInBytes * 8 ==
				numElements * elementType.getIntOrFloatBitWidth()) {
				supportedCopySize = true;
				break;
				}
				}
				if (!supportedCopySize)
				return false;

				// TODO: Condition 2: the alignments must be supported. For cp.async the
				// NVIDIA doc (section 6.4.1) says: "The address must be naturally aligned to
				// a multiple of the access size. If an address is not properly aligned, the
				// resulting behavior is undefined.".
				return true;
				}

				void nvgpu::createAsyncGroups(RewriterBase &rewriter, Operation *op,
				bool bypassL1) {
				llvm::SmallSetVector<Operation *, 16> copyToSharedMem;

				// Look for all the copy that can be converted to async copy ops.
				op->walk([&](Operation *writeOp) {
				ftynseUnsubmitted Done Reply Inline Actions Nit: I believe `walk` can take a lambda returning `void` so we don't need to have `WalkResult::advance()` everywhere. ftynse: Nit: I believe `walk` can take a lambda returning `void` so we don't need to have `WalkResult…
				// Look for contiguous 1D vector store into shared memory.
				if (!isContiguousStore(writeOp))
				ftynseUnsubmitted Done Reply Inline Actions Nit: no need to prefix with `llvm::`, here and below. ftynse: Nit: no need to prefix with `llvm::`, here and below.
				return;
				Value vectorVal = nvgpu::getValueStored(writeOp);
				if (cast<VectorType>(vectorVal.getType()).getRank() != 1)
				return;
				Value storeBase = nvgpu::getMemrefOperand(writeOp);
				if (!nvgpu::NVGPUDialect::hasSharedMemoryAddressSpace(
				cast<MemRefType>(storeBase.getType())))
				return;

				// The stored vector must originate from a contiguous 1D vector load.
				Operation *readOp = vectorVal.getDefiningOp();
				if (readOp == nullptr \|\| !isContiguousRead(readOp))
				return;
				Value loadBase = nvgpu::getMemrefOperand(readOp);
				// Should be reading from global memory (not shared memory).
				if (nvgpu::NVGPUDialect::hasSharedMemoryAddressSpace(
				cast<MemRefType>(loadBase.getType())))
				return;

				// Look for compatible mask and padding.
				if (auto transferRead = dyn_cast<vector::TransferReadOp>(readOp)) {
				if (Value mask = transferRead.getMask()) {
				if (getConstantIntValue(transferRead.getPadding()) ==
				static_cast<int64_t>(0))
				return;
				if (!getMaskOp(readOp))
				return;
				}
				}

				// Check whether both accesses are supported before we emit: this is
				// necessary to ensure the correctness of DeviceAsyncCopyOp.
				VectorType vecType = cast<VectorType>(vectorVal.getType());

				if (!resultsInSupportedAsyncCopy(cast<MemRefType>(loadBase.getType()),
				nvgpu::getIndices(readOp), vecType) \|\|
				!resultsInSupportedAsyncCopy(cast<MemRefType>(storeBase.getType()),
				nvgpu::getIndices(writeOp), vecType))
				return;

				copyToSharedMem.insert(writeOp);
				return;
				});

				while (!copyToSharedMem.empty()) {
				// Start a group with the first write.
				SmallVector<Operation *> group;
				Operation writeOp = copyToSharedMem.begin();
				copyToSharedMem.remove(writeOp);
				group.push_back(writeOp);
				Operation *nextNode = writeOp;

				// Look in the next nodes for more copies to add to the same group.
				while ((nextNode = nextNode->getNextNode())) {
				// Ignore ops without side effects.
				auto memInterface = dyn_cast<MemoryEffectOpInterface>(nextNode);
				if (memInterface && memInterface.hasNoEffect() &&
				!nextNode->hasTrait<OpTrait::HasRecursiveMemoryEffects>())
				continue;
				// Ignore read from a different address space.
				if (isa<vector::TransferReadOp, vector::LoadOp>(nextNode)) {
				Operation *readOp = nextNode;
				Value memrefOperand = nvgpu::getMemrefOperand(readOp);
				if (!nvgpu::NVGPUDialect::hasSharedMemoryAddressSpace(
				cast<MemRefType>(memrefOperand.getType()))) {
				continue;
				}
				}
				if (copyToSharedMem.count(nextNode)) {
				// Found another copy, add it to the group.
				copyToSharedMem.remove(nextNode);
				group.push_back(nextNode);
				continue;
				}
				// If the op is something else stop the accumulating op in the group.
				break;
				}

				// Emit the group.
				SmallVector<Value> tokens;
				for (Operation *writeOp : group) {
				rewriter.setInsertionPoint(writeOp);
				Value vectorVal = nvgpu::getValueStored(writeOp);
				auto vectorType = cast<VectorType>(vectorVal.getType());
				int64_t numElements = vectorType.getNumElements();
				Operation *readOp = vectorVal.getDefiningOp();
				Value storeBase = nvgpu::getMemrefOperand(writeOp);
				Value loadBase = nvgpu::getMemrefOperand(readOp);
				Value numReadElements;
				if (vector::CreateMaskOp maskOp = getMaskOp(readOp)) {
				assert(maskOp.getNumOperands() == 1 && "expected single operand");
				numReadElements = maskOp.getOperand(0);
				}
				ftynseUnsubmitted Not Done Reply Inline Actions For future: this should be really using the data layout mechanism, which would also fix the todo related to alignment above. ftynse: For future: this should be really using the data layout mechanism, which would also fix the…
				auto dstMemref = cast<MemRefType>(storeBase.getType());
				int64_t sizeInBytes =
				ftynseUnsubmitted Done Reply Inline Actions I'd consider returning a (silenceable) failure if `useMMASync` is set and cannot be achieved. Or at least documenting this attribute as a hint rather than a guarantee. ftynse: I'd consider returning a (silenceable) failure if `useMMASync` is set and cannot be achieved.
				(dstMemref.getElementTypeBitWidth() * numElements) / 8;
				// bypass_l1 only possible with 16 byte transfer.
				Value token = rewriter.create<nvgpu::DeviceAsyncCopyOp>(
				writeOp->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
				/dst=/storeBase, /dstIndices=/nvgpu::getIndices(writeOp),
				/src=/loadBase,
				/srcIndices=/nvgpu::getIndices(readOp),
				/dstElements=/rewriter.getIndexAttr(numElements),
				/srcElements=/numReadElements,
				/bypassL1=/bypassL1 && sizeInBytes == 16 ? rewriter.getUnitAttr()
				: UnitAttr());
				tokens.push_back(token);
				}

				// Create the group and wait for it right after.
				Value groupToken = rewriter.create<nvgpu::DeviceAsyncCreateGroupOp>(
				op->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
				tokens);
				rewriter.create<nvgpu::DeviceAsyncWaitOp>(op->getLoc(), groupToken,
				nullptr);
				// Clean up old stores.
				for (Operation *writeOp : group)
				rewriter.eraseOp(writeOp);
				}
				}

mlir/lib/Dialect/NVGPU/Transforms/Utils.cpp

Show All 22 Lines	Operation::operand_range nvgpu::getIndices(Operation *op) {
if (auto loadOp = dyn_cast<memref::LoadOp>(op))		if (auto loadOp = dyn_cast<memref::LoadOp>(op))
return loadOp.getIndices();		return loadOp.getIndices();
if (auto storeOp = dyn_cast<memref::StoreOp>(op))		if (auto storeOp = dyn_cast<memref::StoreOp>(op))
return storeOp.getIndices();		return storeOp.getIndices();
if (auto vectorReadOp = dyn_cast<vector::LoadOp>(op))		if (auto vectorReadOp = dyn_cast<vector::LoadOp>(op))
return vectorReadOp.getIndices();		return vectorReadOp.getIndices();
if (auto vectorStoreOp = dyn_cast<vector::StoreOp>(op))		if (auto vectorStoreOp = dyn_cast<vector::StoreOp>(op))
return vectorStoreOp.getIndices();		return vectorStoreOp.getIndices();
		if (auto transferReadOp = dyn_cast<vector::TransferReadOp>(op))
		return transferReadOp.getIndices();
		if (auto transferWriteOp = dyn_cast<vector::TransferWriteOp>(op))
		return transferWriteOp.getIndices();
llvm_unreachable("unsupported op type");		llvm_unreachable("unsupported op type");
}		}

void nvgpu::setIndices(Operation *op, ArrayRef<Value> indices) {		void nvgpu::setIndices(Operation *op, ArrayRef<Value> indices) {
if (auto ldmatrixOp = dyn_cast<LdMatrixOp>(op))		if (auto ldmatrixOp = dyn_cast<LdMatrixOp>(op))
return ldmatrixOp.getIndicesMutable().assign(indices);		return ldmatrixOp.getIndicesMutable().assign(indices);
if (auto copyOp = dyn_cast<DeviceAsyncCopyOp>(op))		if (auto copyOp = dyn_cast<DeviceAsyncCopyOp>(op))
return copyOp.getDstIndicesMutable().assign(indices);		return copyOp.getDstIndicesMutable().assign(indices);
if (auto loadOp = dyn_cast<memref::LoadOp>(op))		if (auto loadOp = dyn_cast<memref::LoadOp>(op))
return loadOp.getIndicesMutable().assign(indices);		return loadOp.getIndicesMutable().assign(indices);
if (auto storeOp = dyn_cast<memref::StoreOp>(op))		if (auto storeOp = dyn_cast<memref::StoreOp>(op))
return storeOp.getIndicesMutable().assign(indices);		return storeOp.getIndicesMutable().assign(indices);
if (auto vectorReadOp = dyn_cast<vector::LoadOp>(op))		if (auto vectorReadOp = dyn_cast<vector::LoadOp>(op))
return vectorReadOp.getIndicesMutable().assign(indices);		return vectorReadOp.getIndicesMutable().assign(indices);
if (auto vectorStoreOp = dyn_cast<vector::StoreOp>(op))		if (auto vectorStoreOp = dyn_cast<vector::StoreOp>(op))
return vectorStoreOp.getIndicesMutable().assign(indices);		return vectorStoreOp.getIndicesMutable().assign(indices);
		if (auto transferReadOp = dyn_cast<vector::TransferReadOp>(op))
		return transferReadOp.getIndicesMutable().assign(indices);
		if (auto transferWriteOp = dyn_cast<vector::TransferWriteOp>(op))
		return transferWriteOp.getIndicesMutable().assign(indices);
		llvm_unreachable("unsupported op type");
		}

		Value nvgpu::getValueStored(Operation *op) {
		if (auto storeOp = dyn_cast<memref::StoreOp>(op))
		return storeOp.getValueToStore();
		if (auto transferWrite = dyn_cast<vector::TransferWriteOp>(op))
		return transferWrite.getValue();
		if (auto storeOp = dyn_cast<vector::StoreOp>(op))
		return storeOp.getValueToStore();
		llvm_unreachable("unsupported op type");
		}

		Value nvgpu::getMemrefOperand(Operation *op) {
		if (auto loadOp = dyn_cast<memref::LoadOp>(op))
		return loadOp.getMemref();
		if (auto storeOp = dyn_cast<memref::StoreOp>(op))
		return storeOp.getMemref();
		if (auto transferWrite = dyn_cast<vector::TransferWriteOp>(op))
		return transferWrite.getSource();
		if (auto transferRead = dyn_cast<vector::TransferReadOp>(op))
		return transferRead.getSource();
		if (auto storeOp = dyn_cast<vector::StoreOp>(op))
		return storeOp.getBase();
		if (auto loadOp = dyn_cast<vector::LoadOp>(op))
		return loadOp.getBase();
llvm_unreachable("unsupported op type");		llvm_unreachable("unsupported op type");
}		}

mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir

This file was added.

				// RUN: mlir-opt %s -test-transform-dialect-interpreter -split-input-file --verify-diagnostics \| FileCheck %s

				// Check that we produce async copies from the vector.transfer_xxx operations.
				builtin.module {
				// CHECK-LABEL: @copies_to_asyncs
				func.func @copies_to_asyncs(%a: memref<1024x1024xf32>) {
				%0 = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
				%c0 = arith.constant 0 : index
				%c4 = arith.constant 4 : index
				%cst_0 = arith.constant 0.000000e+00 : f32
				// Make sure we emit the bypassL1.
				// CHECK: %[[CP0:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 4 {bypassL1} :
				%1 = vector.transfer_read %a[%c0, %c0], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<4xf32>
				vector.transfer_write %1, %0[%c0, %c0, %c0] {in_bounds = [true]} : vector<4xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
				// CHECK-NOT: nvgpu.device_async_create_group

				// CHECK: %[[CP1:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 1
				%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
				vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
				// CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
				// CHECK: nvgpu.device_async_wait %[[G]]
				return
				}

				transform.sequence failures(propagate) {
				^bb1(%variant_op: !transform.any_op):
				%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
				transform.nvgpu.create_async_groups %top_level_func {bypass_l1} : (!transform.any_op) -> (!transform.any_op)
				}
				}

				// -----

				// Check that we properly take `bypass_l1 = false` into account.
				// I.e., we shouldn't be generating bypassL1 attributes.
				builtin.module {
				// CHECK-LABEL: @copies_to_asyncs_no_mma
				func.func @copies_to_asyncs_no_mma(%a: memref<1024x1024xf32>) {
				%0 = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
				%c0 = arith.constant 0 : index
				%c4 = arith.constant 4 : index
				%cst_0 = arith.constant 0.000000e+00 : f32
				// Make sure we don't emit the bypassL1.
				// CHECK: %[[CP0:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 4 :
				%1 = vector.transfer_read %a[%c0, %c0], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<4xf32>
				vector.transfer_write %1, %0[%c0, %c0, %c0] {in_bounds = [true]} : vector<4xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
				// CHECK-NOT: nvgpu.device_async_create_group

				// CHECK: %[[CP1:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 1 :
				%2 = vector.transfer_read %a[%c0, %c4], %cst_0 {in_bounds = [true]} : memref<1024x1024xf32>, vector<1xf32>
				vector.transfer_write %2, %0[%c0, %c4, %c0] {in_bounds = [true]} : vector<1xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>
				// CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
				// CHECK: nvgpu.device_async_wait %[[G]]
				return
				}

				transform.sequence failures(propagate) {
				^bb1(%variant_op: !transform.any_op):
				%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
				transform.nvgpu.create_async_groups %top_level_func : (!transform.any_op) -> (!transform.any_op)
				}
				}

				// -----

				// Check that pattern works with vector.load/vector.store.
				builtin.module {
				// CHECK-LABEL: @copies_to_asyncs_load_store
				func.func @copies_to_asyncs_load_store(%a: memref<1024x1024xf32>) {
				%0 = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
				%c0 = arith.constant 0 : index
				%c4 = arith.constant 4 : index
				%cst_0 = arith.constant 0.000000e+00 : f32
				// CHECK: %[[CP0:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 4 :
				%1 = vector.load %a[%c0, %c0] : memref<1024x1024xf32>, vector<4xf32>
				vector.store %1, %0[%c0, %c0, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<4xf32>
				// CHECK-NOT: nvgpu.device_async_create_group

				// CHECK: %[[CP1:.]] = nvgpu.device_async_copy {{.}}, {{.*}}, 1 :
				%2 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<1xf32>
				vector.store %2, %0[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<1xf32>
				// CHECK: %[[G:.*]] = nvgpu.device_async_create_group %[[CP0]], %[[CP1]]
				// CHECK: nvgpu.device_async_wait %[[G]]
				return
				}

				transform.sequence failures(propagate) {
				^bb1(%variant_op: !transform.any_op):
				%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
				transform.nvgpu.create_async_groups %top_level_func : (!transform.any_op) -> (!transform.any_op)
				}
				}

				// -----

				// Check that pattern skips unaligned and unsupported sizes.
				builtin.module {
				// CHECK-LABEL: @copies_to_asyncs_load_store
				func.func @copies_to_asyncs_load_store(%a: memref<1024x1024xf32>, %b: memref<1024x1024xf16>) {
				%alloc = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
				%alloc_1 = memref.alloc() : memref<4x32x16xf16, #gpu.address_space<workgroup>>
				%c0 = arith.constant 0 : index
				%c4 = arith.constant 4 : index
				%cst_0 = arith.constant 0.000000e+00 : f32

				// Requires 1-D vector load
				// CHECK-NOT: nvgpu.device_async_copy
				// CHECK: vector.load
				// CHECK: vector.store
				%1 = vector.load %a[%c0, %c4] : memref<1024x1024xf32>, vector<2x2xf32>
				vector.store %1, %alloc[%c0, %c4, %c0] : memref<4x32x16xf32, #gpu.address_space<workgroup>>, vector<2x2xf32>
				// CHECK-NOT: nvgpu.device_async_create_group

				// CHECK-NOT: nvgpu.device_async_copy
				// CHECK: vector.load
				// CHECK: vector.store
				%2 = vector.load %b[%c0, %c4] : memref<1024x1024xf16>, vector<1xf16>
				vector.store %2, %alloc_1[%c0, %c4, %c0] : memref<4x32x16xf16, #gpu.address_space<workgroup>>, vector<1xf16>
				// CHECK-NOT: nvgpu.device_async_create_group
				return
				}

				transform.sequence failures(propagate) {
				^bb1(%variant_op: !transform.any_op):
				%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
				transform.nvgpu.create_async_groups %top_level_func : (!transform.any_op) -> (!transform.any_op)
				}
				}

				// -----

				// vector.transfer_read with a mask.
				builtin.module {
				// CHECK-LABEL: @read_with_mask(
				// CHECK-SAME: %{{.}}: memref<1024x1024xf32>, %[[sz:.]]: index
				func.func @read_with_mask(%a: memref<1024x1024xf32>, %sz: index) {
				%0 = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
				%c0 = arith.constant 0 : index
				%cst_0 = arith.constant 0.000000e+00 : f32
				// CHECK: nvgpu.device_async_copy {{.}}, {{.}}, 4, %[[sz]] {bypassL1} :
				%mask = vector.create_mask %sz : vector<4xi1>
				%1 = vector.transfer_read %a[%c0, %c0], %cst_0, %mask {in_bounds = [true]} : memref<1024x1024xf32>, vector<4xf32>
				vector.transfer_write %1, %0[%c0, %c0, %c0] {in_bounds = [true]} : vector<4xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>

				return
				}

				transform.sequence failures(propagate) {
				^bb1(%variant_op: !transform.any_op):
				%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
				transform.nvgpu.create_async_groups %top_level_func {bypass_l1} : (!transform.any_op) -> (!transform.any_op)
				}
				}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,842 Lines • ▼ Show 20 Lines	deps = [
":ArithUtils",		":ArithUtils",
":DialectUtils",		":DialectUtils",
":GPUDialect",		":GPUDialect",
":IR",		":IR",
":LinalgDialect",		":LinalgDialect",
":MemRefDialect",		":MemRefDialect",
":NVGPUDialect",		":NVGPUDialect",
":NVGPUTransformOpsIncGen",		":NVGPUTransformOpsIncGen",
		":NVGPUTransforms",
":SCFDialect",		":SCFDialect",
":SCFTransforms",		":SCFTransforms",
":Support",		":Support",
":TransformDialect",		":TransformDialect",
":VectorDialect",		":VectorDialect",
"//llvm:Support",		"//llvm:Support",
],		],
)		)
▲ Show 20 Lines • Show All 9,038 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU][transform] Add `create_async_groups` transform opClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 541476

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h

mlir/include/mlir/Dialect/NVGPU/Transforms/Utils.h

mlir/include/mlir/Dialect/Vector/IR/VectorOps.td

mlir/lib/Dialect/NVGPU/TransformOps/CMakeLists.txt

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt

mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp

mlir/lib/Dialect/NVGPU/Transforms/Utils.cpp

mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][NVGPU][transform] Add `create_async_groups` transform op
ClosedPublic