This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Conversion/GPUCommon/
-
GPUCommon/
1/1
GPUCommonPass.h
-
Dialect/NVGPU/IR/
-
NVGPU/
-
IR/
-
NVGPU.td
-
lib/
-
Conversion/
-
GPUCommon/
-
GPUToLLVMConversion.cpp
-
NVGPUToNVVM/
3/3
NVGPUToNVVM.cpp
-
Dialect/NVGPU/IR/
-
NVGPU/
-
IR/
1
NVGPUDialect.cpp
-
ExecutionEngine/
1/1
CudaRuntimeWrappers.cpp
-
test/Conversion/NVGPUToNVVM/
-
Conversion/
-
NVGPUToNVVM/
1/1
nvgpu-to-nvvm.mlir

Differential D155680

[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor
ClosedPublic

Authored by guraypp on Jul 19 2023, 1:24 AM.

Download Raw Diff

Details

Reviewers

qcolombet
nicolasvasilache
ftynse
ThomasRaoux
dcaballe
herhut
manishucsd

Commits

rGe56d6745f7d7: [mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor

Summary

The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The tensor is the source tensor to be tiled. The boxDimensions is the size of the tiled memory region in each dimension.

The pattern here lowers tma.create.descriptor to a runtime function call that eventually calls calls CUDA Driver's cuTensorMapEncodeTiled. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

Depends on D155453

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Jul 19 2023, 1:24 AM

Herald added a reviewer: ftynse. · View Herald TranscriptJul 19 2023, 1:24 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 27 others. · View Herald Transcript

guraypp requested review of this revision.Jul 19 2023, 1:24 AM

Herald added a reviewer: herhut. · View Herald TranscriptJul 19 2023, 1:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, jholewinski. · View Herald Transcript

guraypp added a reviewer: manishucsd.Jul 19 2023, 1:25 AM

Harbormaster completed remote builds in B246455: Diff 541879.Jul 19 2023, 1:25 AM

Generally looks good, thanks for adding this!

Qq on the dependency on gpu.alloc_managed.
It seems to me the managed alloc is hidden within the tma.create_descriptor op and you would only need a tma.destroy_descriptor to free the underlying memory?

Do you anticipate actually needing this gpu.alloc_managed op for other use cases?

Qq on the dependency on gpu.alloc_managed.
It seems to me the managed alloc is hidden within the tma.create_descriptor op

I hid for the sake of simplicity. I can also call cuMemAlloc + cuMemCpy. I can do that to break the dependency.

and you would only need a tma.destroy_descriptor to free the underlying memory?

That's right, there is need for tma.destroy_descriptor Op with/without managed memory. Thanks for bringing that up!

Do you anticipate actually needing this gpu.alloc_managed op for other use cases?

Managed memory is easy. I like using it when writing IR by hand. It performs as well as manual sync copy if program copies data in the beginning and copies back in the end. Because NVIDIA GPUs have page migration, the hardware keeps pages and has TLBs to cache pages.
It causes thrashing when there are copies between kernels. It's possible to alleviate this with cudaMemAdvise. Also, cudaMemPrefetchAsync can be used to catch the same performance as manul asyncronus copy.

guraypp mentioned this in D155838: [mlir] Nvidia Hopper TMA load integration test.Jul 20 2023, 6:40 AM

break dependency to managed memory

Harbormaster completed remote builds in B246904: Diff 542499.Jul 20 2023, 9:42 AM

nicolasvasilache accepted this revision.Jul 20 2023, 11:47 PM

nicolasvasilache added inline comments.

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h
53	Some minor doc comment plz.
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
953	seems this makeConst and the one in the following function could be factored out as a static Value makeI64Const(...)
955	`rewriter.getIntegerType(64)` for consistency ?
997	Can we rename this `elementTypeAsLLVMConstant` ? I find it confusing to have a type of type `Value`; this is purely because of LLVM encoding details.
mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
366	spurious duplication of C++
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
288	stale comment as we don't use managed memory anymore in this PR?
mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
650	nit: newline

This revision is now accepted and ready to land.Jul 20 2023, 11:47 PM

address comments

guraypp edited the summary of this revision. (Show Details)Jul 21 2023, 2:28 AM

guraypp marked 6 inline comments as done.

This revision was landed with ongoing or failed builds.Jul 21 2023, 2:33 AM

Closed by commit rGe56d6745f7d7: [mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rGe56d6745f7d7: [mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor.

springerm mentioned this in rG544f0e9161db: [mlir] Fix build after D155680.Jul 21 2023, 4:36 AM

Harbormaster completed remote builds in B247144: Diff 542814.Jul 21 2023, 4:57 AM

nicolasvasilache mentioned this in rGca74ad88295f: [mlir] Nvidia Hopper TMA load integration test.Jul 28 2023, 2:06 PM

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUCommon/

GPUCommonPass.h

15 lines

Dialect/

NVGPU/

IR/

NVGPU.td

24 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

12 lines

NVGPUToNVVM/

NVGPUToNVVM.cpp

121 lines

Dialect/

NVGPU/

IR/

NVGPUDialect.cpp

11 lines

ExecutionEngine/

CudaRuntimeWrappers.cpp

65 lines

test/

Conversion/

NVGPUToNVVM/

nvgpu-to-nvvm.mlir

29 lines

Diff 542827

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

	//===- GPUCommonPass.h - MLIR GPU runtime support -------------------------===//			//===- GPUCommonPass.h - MLIR GPU runtime support -------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_			#ifndef MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_
	#define MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_			#define MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

	#include "mlir/Dialect/GPU/Transforms/Utils.h"			#include "mlir/Dialect/GPU/Transforms/Utils.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/IR/Builders.h"
				#include "mlir/IR/Types.h"
	#include "mlir/Support/LLVM.h"			#include "mlir/Support/LLVM.h"
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
	#include <functional>			#include <functional>
	#include <vector>			#include <vector>

	namespace llvm {			namespace llvm {
	class LLVMContext;			class LLVMContext;
	class Module;			class Module;
	Show All 22 Lines
	#include "mlir/Conversion/Passes.h.inc"			#include "mlir/Conversion/Passes.h.inc"

	using OwnedBlob = std::unique_ptr<std::vector<char>>;			using OwnedBlob = std::unique_ptr<std::vector<char>>;
	using BlobGenerator =			using BlobGenerator =
	std::function<OwnedBlob(const std::string &, Location, StringRef)>;			std::function<OwnedBlob(const std::string &, Location, StringRef)>;
	using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(			using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(
	Operation *, llvm::LLVMContext &, StringRef)>;			Operation *, llvm::LLVMContext &, StringRef)>;

				struct FunctionCallBuilder {
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Some minor doc comment plz. nicolasvasilache: Some minor doc comment plz.
				FunctionCallBuilder(StringRef functionName, Type returnType,
				ArrayRef<Type> argumentTypes)
				: functionName(functionName),
				functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}
				LLVM::CallOp create(Location loc, OpBuilder &builder,
				ArrayRef<Value> arguments) const;

				StringRef functionName;
				LLVM::LLVMFunctionType functionType;
				};

	/// Collect a set of patterns to convert from the GPU dialect to LLVM and			/// Collect a set of patterns to convert from the GPU dialect to LLVM and
	/// populate converter for gpu types.			/// populate converter for gpu types.
	void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,			void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
	RewritePatternSet &patterns,			RewritePatternSet &patterns,
	StringRef gpuBinaryAnnotation = {},			StringRef gpuBinaryAnnotation = {},
	bool kernelBarePtrCallConv = false);			bool kernelBarePtrCallConv = false);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_			#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

Show First 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", []> {
let assemblyFormat = [{		let assemblyFormat = [{
$tensorMapDescriptor `[` $coordinates `]` `,` $barrier `to` $dst		$tensorMapDescriptor `[` $coordinates `]` `,` $barrier `to` $dst
attr-dict `:` type($tensorMapDescriptor) `,` type($barrier) `->` type($dst)		attr-dict `:` type($tensorMapDescriptor) `,` type($barrier) `->` type($dst)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;

}		}

		def NVGPU_TmaCreateDescriptorOp : NVGPU_Op<"tma.create.descriptor", []> {
		let summary = "TMA create descriptor";
		let description = [{
		The Op creates a tensor map descriptor object representing tiled memory
		region. To do that it calls CUDA Driver's `cuTensorMapEncodeTiled`. The
		descriptor is used by Tensor Memory Access (TMA).

		The `tensor` is the source tensor to be tiled.

		The `boxDimensions` is the size of the tiled memory region in each dimension.

		For more information see below:
		https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
		}];

		let arguments = (ins AnyUnrankedMemRef:$tensor,
		Variadic<Index>:$boxDimensions);
		let results = (outs NVGPU_TensorMapDescriptor:$tensorMap);
		let assemblyFormat = [{
		$tensor `box` `[` $boxDimensions `]` attr-dict `:` type($tensor) `->` type($tensorMap)
		}];
		let hasVerifier = 1;
		}

#endif // NVGPU		#endif // NVGPU

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	class GpuToLLVMConversionPass
: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {		: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {
public:		public:
using Base::Base;		using Base::Base;

// Run the dialect converter on the module.		// Run the dialect converter on the module.
void runOnOperation() override;		void runOnOperation() override;
};		};

struct FunctionCallBuilder {
FunctionCallBuilder(StringRef functionName, Type returnType,
ArrayRef<Type> argumentTypes)
: functionName(functionName),
functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}
LLVM::CallOp create(Location loc, OpBuilder &builder,
ArrayRef<Value> arguments) const;

StringRef functionName;
LLVM::LLVMFunctionType functionType;
};

template <typename OpTy>		template <typename OpTy>
class ConvertOpToGpuRuntimeCallPattern : public ConvertOpToLLVMPattern<OpTy> {		class ConvertOpToGpuRuntimeCallPattern : public ConvertOpToLLVMPattern<OpTy> {
public:		public:
explicit ConvertOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)		explicit ConvertOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToLLVMPattern<OpTy>(typeConverter) {}		: ConvertOpToLLVMPattern<OpTy>(typeConverter) {}

protected:		protected:
Value getNumElements(ConversionPatternRewriter &rewriter, Location loc,		Value getNumElements(ConversionPatternRewriter &rewriter, Location loc,
▲ Show 20 Lines • Show All 1,734 Lines • Show Last 20 Lines

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//		//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"		#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"

		#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
		#include "mlir/Dialect/LLVMIR/LLVMTypes.h"
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"		#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
		#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"		#include "mlir/IR/TypeUtilities.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
		#include "llvm/Support/raw_ostream.h"

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS		#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;

▲ Show 20 Lines • Show All 892 Lines • ▼ Show 20 Lines	for (auto [index, value] : llvm::enumerate(coords)) {
coords[index] = truncToI32(rewriter, op->getLoc(), value);		coords[index] = truncToI32(rewriter, op->getLoc(), value);
}		}

rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(		rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(
op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);		op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);
return success();		return success();
}		}
};		};

		static Value makeI64Const(RewriterBase &rewriter, Operation *op,
		int32_t index) {
		return rewriter.create<LLVM::ConstantOp>(op->getLoc(),
		rewriter.getIntegerType(64),
		rewriter.getI32IntegerAttr(index));
		}

		/// Returns a Value that holds data type enum that is expected by CUDA driver.
		static Value elementTypeAsLLVMConstant(RewriterBase &rewriter, Operation *op,
		Type type) {
		// Enum is from CUDA driver API
		// https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html
		enum CUtensorMapDataTypeEnum {
		CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0,
		CU_TENSOR_MAP_DATA_TYPE_UINT16,
		CU_TENSOR_MAP_DATA_TYPE_UINT32,
		CU_TENSOR_MAP_DATA_TYPE_INT32,
		CU_TENSOR_MAP_DATA_TYPE_UINT64,
		CU_TENSOR_MAP_DATA_TYPE_INT64,
		CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
		CU_TENSOR_MAP_DATA_TYPE_FLOAT32,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions seems this makeConst and the one in the following function could be factored out as a static Value makeI64Const(...) nicolasvasilache: seems this makeConst and the one in the following function could be factored out as a ```static…
		CU_TENSOR_MAP_DATA_TYPE_FLOAT64,
		CU_TENSOR_MAP_DATA_TYPE_BFLOAT16,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions `rewriter.getIntegerType(64)` for consistency ? nicolasvasilache: `rewriter.getIntegerType(64)` for consistency ?
		CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ,
		CU_TENSOR_MAP_DATA_TYPE_TFLOAT32,
		CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ
		};

		if (type.isUnsignedInteger(8))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_UINT8);
		if (type.isUnsignedInteger(16))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_UINT16);
		if (type.isUnsignedInteger(32))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_UINT32);
		if (type.isUnsignedInteger(64))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_UINT64);
		if (type.isSignlessInteger(32))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_INT32);
		if (type.isSignlessInteger(64))
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_INT64);
		if (type.isF16())
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_FLOAT16);
		if (type.isF32())
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_FLOAT32);
		if (type.isF64())
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_FLOAT64);
		if (type.isBF16())
		return makeI64Const(rewriter, op, CU_TENSOR_MAP_DATA_TYPE_BFLOAT16);

		llvm_unreachable("Not supported data type");
		}

		struct NVGPUTmaCreateDescriptorOpLowering
		: public ConvertOpToLLVMPattern<nvgpu::TmaCreateDescriptorOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::TmaCreateDescriptorOp>::ConvertOpToLLVMPattern;
		LogicalResult
		matchAndRewrite(nvgpu::TmaCreateDescriptorOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op->getLoc();
		LLVM::LLVMPointerType llvmPointerType = getTypeConverter()->getPointerType(
		IntegerType::get(op->getContext(), 8));
		Type llvmInt64Type = IntegerType::get(op->getContext(), 64);

		Value tensorElementType = elementTypeAsLLVMConstant(
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Can we rename this `elementTypeAsLLVMConstant` ? I find it confusing to have a type of type `Value`; this is purely because of LLVM encoding details. nicolasvasilache: Can we rename this `elementTypeAsLLVMConstant` ? I find it confusing to have a type of type…
		rewriter, op, op.getTensor().getType().getElementType());
		auto promotedOperands = getTypeConverter()->promoteOperands(
		loc, op->getOperands(), adaptor.getOperands(), rewriter);

		Value boxArrayPtr = rewriter.create<LLVM::AllocaOp>(
		loc, llvmPointerType, llvmInt64Type, makeI64Const(rewriter, op, 5));
		for (auto [index, value] : llvm::enumerate(adaptor.getBoxDimensions())) {
		Value gep = rewriter.create<LLVM::GEPOp>(
		loc, llvmPointerType, llvmPointerType, boxArrayPtr,
		makeI64Const(rewriter, op, index));
		rewriter.create<LLVM::StoreOp>(loc, value, gep);
		}

		nvgpu::TensorMapDescriptorType desc = op.getTensorMap().getType();
		// Set Arguments for the function call
		SmallVector<Value> arguments;
		arguments.push_back(promotedOperands[0]); // rank
		arguments.push_back(promotedOperands[1]); // descriptor
		arguments.push_back(tensorElementType); // data type
		arguments.push_back(
		makeI64Const(rewriter, op, (int)desc.getInterleave())); // interleave
		arguments.push_back(
		makeI64Const(rewriter, op, (int)desc.getSwizzle())); // swizzle
		arguments.push_back(
		makeI64Const(rewriter, op, (int)desc.getL2promo())); // l2promo
		arguments.push_back(makeI64Const(rewriter, op, (int)desc.getOob())); // oob
		arguments.push_back(boxArrayPtr); // box dimensions

		// Set data types of the arguments
		SmallVector<Type> argTypes = {
		llvmInt64Type, /* int64_t tensorRank */
		llvmPointerType, /* ptr */
		llvmInt64Type, /* int64_t */
		llvmInt64Type, /* int64_t */
		llvmInt64Type, /* int64_t */
		llvmInt64Type, /* int64_t */
		llvmInt64Type, /* int64_t */
		llvmPointerType /* ptr */
		};
		FunctionCallBuilder hostRegisterCallBuilder = {
		"mgpuTensorMapEncodeTiledMemref", llvmPointerType, argTypes};
		Value tensorMap =
		hostRegisterCallBuilder.create(loc, rewriter, arguments).getResult();

		rewriter.replaceOp(op, tensorMap);
		return success();
		}
		};

} // namespace		} // namespace

void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns) {		RewritePatternSet &patterns) {
patterns.add<		patterns.add<
NVGPUMBarrierCreateLowering, // nvgpu.mbarrier.create		NVGPUMBarrierCreateLowering, // nvgpu.mbarrier.create
NVGPUMBarrierInitLowering, // nvgpu.mbarrier.init		NVGPUMBarrierInitLowering, // nvgpu.mbarrier.init
NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive		NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive
NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete		NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete
NVGPUMBarrierTestWaitLowering, // nvgpu.mbarrier.test_wait_parity		NVGPUMBarrierTestWaitLowering, // nvgpu.mbarrier.test_wait_parity
NVGPUMBarrierTryWaitParityLowering, // nvgpu.mbarrier.try_wait_parity		NVGPUMBarrierTryWaitParityLowering, // nvgpu.mbarrier.try_wait_parity
		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
		NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor
NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx		NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx
NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,		MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
NVGPUMmaSparseSyncLowering>(converter);		NVGPUMmaSparseSyncLowering>(converter);
}		}

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

Show First 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	if (getCoordinates().size() != size_t(dstMemref.getRank())) {
return emitError() << "Destination memref rank is "		return emitError() << "Destination memref rank is "
<< size_t(dstMemref.getRank()) << " but there are "		<< size_t(dstMemref.getRank()) << " but there are "
<< getCoordinates().size()		<< getCoordinates().size()
<< " coordinates. They must match.";		<< " coordinates. They must match.";
}		}
return success();		return success();
}		}

		LogicalResult TmaCreateDescriptorOp::verify() {
		if (getBoxDimensions().size() > 5) {
		return emitError() << "Maximum 5 dimensional box is supported.";
		}
		nvgpu::TensorMapDescriptorType desc = getTensorMap().getType();
		if (desc.getInterleave() != TensorMapInterleaveKind::INTERLEAVE_NONE)
		return emitError() << "Interleave options are not supported yet.";

		return success();
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions spurious duplication of C++ nicolasvasilache: spurious duplication of C++
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// TableGen'd dialect, type, and op definitions		// TableGen'd dialect, type, and op definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#define GET_ATTRDEF_CLASSES		#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc"

#include "mlir/Dialect/NVGPU/IR/NVGPUEnums.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUEnums.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPU.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPU.cpp.inc"

#define GET_TYPEDEF_CLASSES		#define GET_TYPEDEF_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPUTypes.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUTypes.cpp.inc"

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	mgpuMemHostUnregisterMemRef(int64_t rank,
auto ptr = descriptor->data + descriptor->offset elementSizeBytes;		auto ptr = descriptor->data + descriptor->offset elementSizeBytes;
mgpuMemHostUnregister(ptr);		mgpuMemHostUnregister(ptr);
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSetDefaultDevice(int32_t device) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSetDefaultDevice(int32_t device) {
defaultDevice = device;		defaultDevice = device;
}		}

		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuTensorMapEncodeTiled(
		CUtensorMap *tensorMap, // Tensor map object
		CUtensorMapDataType tensorDataType, // Tensor data type
		cuuint32_t tensorRank, // Dimensionality of tensor
		void *globalAddress, // Starting address
		const cuuint64_t *globalDim, // Tensor size (number of elements)
		const cuuint64_t *globalStrides, // Stride size (in bytes)
		const cuuint32_t *boxDim, // Traversal box (number of elments)
		const cuuint32_t *elementStrides, // Traversal stride
		CUtensorMapInterleave interleave, // Type of interleaved layout
		CUtensorMapSwizzle swizzle, // Bank swizzling pattern
		CUtensorMapL2promotion l2Promotion, // L2 promotion size
		CUtensorMapFloatOOBfill oobFill // Padding zfill or NaN fill
		) {
		ScopedContext scopedContext;
		CUDA_REPORT_IF_ERROR(cuTensorMapEncodeTiled(
		tensorMap, tensorDataType, tensorRank, globalAddress, globalDim,
		globalStrides, boxDim, elementStrides, interleave, swizzle, l2Promotion,
		oobFill));
		}

		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *mgpuTensorMapEncodeTiledMemref(
		int64_t tensorRank, // Dimensionality of tensor
		StridedMemRefType<char, 1> *descriptor, // Starting address
		const CUtensorMapDataType tensorDataType, // Stride size (in bytes)
		CUtensorMapInterleave interleave, // Type of interleaved layout
		CUtensorMapSwizzle swizzle, // Bank swizzling pattern
		CUtensorMapL2promotion l2Promotion, // L2 promotion size
		CUtensorMapFloatOOBfill oobFill, // Padding zfill or NaN fill
		int64_t *inputBoxDims // Tensor size (number of elements)
		) {
		CUtensorMap tensorMap;
		nicolasvasilacheUnsubmitted Done Reply Inline Actions stale comment as we don't use managed memory anymore in this PR? nicolasvasilache: stale comment as we don't use managed memory anymore in this PR?

		auto *globalAddress = descriptor->data;
		uint32_t boxDim[5] = {0}, elementStrides[5] = {0};
		uint64_t globalDim[5] = {0}, globalStrides[5] = {0};
		uint32_t tensorRank32 = uint32_t(tensorRank);

		static const int elementSizeInBytes[] = {1, 2, 4, 4, 8, 8, 2,
		4, 8, 2, 4, 4, 4};
		for (int64_t r = 0; r < tensorRank; ++r) {
		elementStrides[r] = uint32_t(1);
		boxDim[r] = static_cast<uint32_t>(inputBoxDims[tensorRank - r - 1]);
		globalDim[r] = static_cast<uint64_t>(descriptor->sizes[tensorRank - r - 1]);
		}

		globalStrides[0] = globalDim[0] * elementSizeInBytes[tensorDataType];
		for (int r = 1; r < tensorRank - 1; r++)
		globalStrides[r] = globalStrides[r - 1] * globalDim[1] *
		elementSizeInBytes[tensorDataType];

		ScopedContext scopedContext;
		mgpuTensorMapEncodeTiled(&tensorMap, tensorDataType, tensorRank32,
		globalAddress, globalDim, globalStrides, boxDim,
		elementStrides, interleave, swizzle, l2Promotion,
		oobFill);
		// Copy created tensor map to device
		CUdeviceptr dTensorMap;
		CUDA_REPORT_IF_ERROR(cuMemAlloc(&dTensorMap, sizeof(CUtensorMap)));
		CUDA_REPORT_IF_ERROR(cuMemcpy(dTensorMap,
		reinterpret_cast<CUdeviceptr>(&tensorMap),
		sizeof(CUtensorMap)));
		return reinterpret_cast<void *>(dTensorMap);
		}

#ifdef MLIR_ENABLE_CUDA_CUSPARSE		#ifdef MLIR_ENABLE_CUDA_CUSPARSE

///		///
/// Wrapper methods for the cuSparse library.		/// Wrapper methods for the cuSparse library.
///		///

// Some macro magic to get float/double alpha and beta on host.		// Some macro magic to get float/double alpha and beta on host.
#define ALPHABETA(dtp, alpha, beta) \		#define ALPHABETA(dtp, alpha, beta) \
▲ Show 20 Lines • Show All 394 Lines • Show Last 20 Lines

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

Show First 20 Lines • Show All 601 Lines • ▼ Show 20 Lines	func.func @mbarrier_txcount() {
nvgpu.mbarrier.try_wait.parity %barrier, %phase, %ticks : !barrierType		nvgpu.mbarrier.try_wait.parity %barrier, %phase, %ticks : !barrierType

func.return		func.return
}		}

// -----		// -----

// CHECK-LABEL: func @async_tma_load		// CHECK-LABEL: func @async_tma_load
!tensorMap1d = !nvgpu.tensormap.descriptor<tensor = memref<128xf32,3>, swizzle=none, l2promo = none, oob = nan, interleave = interleave_16b>		!tensorMap1d = !nvgpu.tensormap.descriptor<tensor = memref<128xf32,3>, swizzle=none, l2promo = none, oob = nan, interleave = none>
!tensorMap2d = !nvgpu.tensormap.descriptor<tensor = memref<32x32xf32,3>, swizzle=swizzle_32b, l2promo = none, oob = zero, interleave = none>		!tensorMap2d = !nvgpu.tensormap.descriptor<tensor = memref<32x32xf32,3>, swizzle=swizzle_32b, l2promo = none, oob = zero, interleave = none>
!tensorMap3d = !nvgpu.tensormap.descriptor<tensor = memref<2x32x32xf32,3>, swizzle=swizzle_64b, l2promo = l2promo_64b, oob = zero, interleave = none>		!tensorMap3d = !nvgpu.tensormap.descriptor<tensor = memref<2x32x32xf32,3>, swizzle=swizzle_64b, l2promo = l2promo_64b, oob = zero, interleave = none>
!tensorMap4d = !nvgpu.tensormap.descriptor<tensor = memref<2x2x32x32xf32,3>, swizzle=swizzle_128b,l2promo = l2promo_128b,oob = zero, interleave = none>		!tensorMap4d = !nvgpu.tensormap.descriptor<tensor = memref<2x2x32x32xf32,3>, swizzle=swizzle_128b,l2promo = l2promo_128b,oob = zero, interleave = interleave_16b>
!tensorMap5d = !nvgpu.tensormap.descriptor<tensor = memref<2x2x2x32x32xf32,3>, swizzle=none, l2promo = none, oob = zero, interleave = none>		!tensorMap5d = !nvgpu.tensormap.descriptor<tensor = memref<2x2x2x32x32xf32,3>, swizzle=none, l2promo = none, oob = zero, interleave = none>
!mbarrier = !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>		!mbarrier = !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
func.func @async_tma_load(%tensorMap1d: !tensorMap1d, %tensorMap2d: !tensorMap2d, %tensorMap3d: !tensorMap3d, %tensorMap4d: !tensorMap4d, %tensorMap5d: !tensorMap5d,		func.func @async_tma_load(%tensorMap1d: !tensorMap1d, %tensorMap2d: !tensorMap2d, %tensorMap3d: !tensorMap3d, %tensorMap4d: !tensorMap4d, %tensorMap5d: !tensorMap5d,
%buffer1d: memref<128xf32,3>,		%buffer1d: memref<128xf32,3>,
%buffer2d: memref<32x32xf32,3>,		%buffer2d: memref<32x32xf32,3>,
%buffer3d: memref<2x32x32xf32,3>,		%buffer3d: memref<2x32x32xf32,3>,
%buffer4d: memref<2x2x32x32xf32,3>,		%buffer4d: memref<2x2x32x32xf32,3>,
%buffer5d: memref<2x2x2x32x32xf32,3>,		%buffer5d: memref<2x2x2x32x32xf32,3>,
%mbarrier: !mbarrier) {		%mbarrier: !mbarrier) {
%crd0 = arith.constant 0 : index		%crd0 = arith.constant 0 : index
%crd1 = arith.constant 0 : index		%crd1 = arith.constant 0 : index
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}]		// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}]
nvgpu.tma.async.load %tensorMap1d[%crd0], %mbarrier to %buffer1d : !tensorMap1d, !mbarrier -> memref<128xf32,3>		nvgpu.tma.async.load %tensorMap1d[%crd0], %mbarrier to %buffer1d : !tensorMap1d, !mbarrier -> memref<128xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.*}}]		// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap2d[%crd0, %crd1], %mbarrier to %buffer2d : !tensorMap2d, !mbarrier -> memref<32x32xf32,3>		nvgpu.tma.async.load %tensorMap2d[%crd0, %crd1], %mbarrier to %buffer2d : !tensorMap2d, !mbarrier -> memref<32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}]		// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}]
nvgpu.tma.async.load %tensorMap3d[%crd0, %crd1, %crd0], %mbarrier to %buffer3d : !tensorMap3d, !mbarrier -> memref<2x32x32xf32,3>		nvgpu.tma.async.load %tensorMap3d[%crd0, %crd1, %crd0], %mbarrier to %buffer3d : !tensorMap3d, !mbarrier -> memref<2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}, %{{.*}}]		// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}, %{{.*}}]
nvgpu.tma.async.load %tensorMap4d[%crd0, %crd1, %crd1, %crd0], %mbarrier to %buffer4d : !tensorMap4d, !mbarrier -> memref<2x2x32x32xf32,3>		nvgpu.tma.async.load %tensorMap4d[%crd0, %crd1, %crd1, %crd0], %mbarrier to %buffer4d : !tensorMap4d, !mbarrier -> memref<2x2x32x32xf32,3>
// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}]		// CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.}}, %{{.}}, %{{.}}, box[%{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}]
nvgpu.tma.async.load %tensorMap5d[%crd0, %crd1, %crd1, %crd0, %crd0], %mbarrier to %buffer5d : !tensorMap5d, !mbarrier -> memref<2x2x2x32x32xf32,3>		nvgpu.tma.async.load %tensorMap5d[%crd0, %crd1, %crd1, %crd0, %crd0], %mbarrier to %buffer5d : !tensorMap5d, !mbarrier -> memref<2x2x2x32x32xf32,3>
func.return		func.return
}		}

// -----		func.func @create_tensor_map(%devicePtr2d : memref<64x128xf32>, %devicePtr1d : memref<128xf32>) {
		%crd0 = arith.constant 64 : index
!barrierType = !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>		%crd1 = arith.constant 128 : index
module @find_parent{		%devicePtr2d_unranked = memref.cast %devicePtr2d : memref<64x128xf32> to memref<*xf32>
func.func @main() {		// CHECK : llvm.call @mgpuTensorMapEncodeTiledMemref
%c1 = arith.constant 1 : index		%tensorMap2d = nvgpu.tma.create.descriptor %devicePtr2d_unranked box[%crd0, %crd1] : memref<*xf32> -> !tensorMap2d
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1, %block_z = %c1) {		%devicePtr1d_unranked = memref.cast %devicePtr1d : memref<128xf32> to memref<*xf32>
// CHECK: memref.get_global @__mbarrier : memref<1xi64, 3>		// CHECK : llvm.call @mgpuTensorMapEncodeTiledMemref
%barrier = nvgpu.mbarrier.create -> !barrierType		%tensorMap1d = nvgpu.tma.create.descriptor %devicePtr1d_unranked box[%crd1] : memref<*xf32> -> !tensorMap1d
gpu.terminator
}
func.return		func.return
}		}
		nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: newline nicolasvasilache: nit: newline
}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 542827

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor
ClosedPublic