This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/IR/
-
mlir/
-
Dialect/
-
GPU/
-
IR/
-
GPUOps.td
-
lib/
-
Conversion/GPUCommon/
-
GPUCommon/
-
GPUToLLVMConversion.cpp
-
Dialect/GPU/IR/
-
GPU/
-
IR/
-
GPUDialect.cpp
-
ExecutionEngine/
-
CudaRuntimeWrappers.cpp
-
RocmRuntimeWrappers.cpp
-
test/Integration/GPU/CUDA/
-
Integration/
-
GPU/
-
CUDA/
-
managed.mlir

Differential D154839

[mlir][gpu] Add `gpu.alloc_managed`
Needs ReviewPublic

Authored by guraypp on Jul 10 2023, 6:08 AM.

Download Raw Diff

Details

Reviewers

ftynse
bondhugula
ThomasRaoux
dcaballe
nicolasvasilache
herhut
qcolombet

Summary

This work adds a new op gpu.alloc_managed that allocates memory region whose pointer is visible to GPU and CPU. It is similar to memref.alloc however the data is migrated automatically between GPU and CPU by the driver via page faults or underlying software.

gpu.alloc_managed works synchronous fashion, not async. Therefore, it cannot be mapped to existing gpu.alloc that can be executed asynchronously. Note that: the gpu.alloc has host_shared option, which is not used anywhere. I am not sure what the intend was there.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Jul 10 2023, 6:08 AM

Herald added a reviewer: ftynse. · View Herald TranscriptJul 10 2023, 6:08 AM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 25 others. · View Herald Transcript

guraypp requested review of this revision.Jul 10 2023, 6:08 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 10 2023, 6:08 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

guraypp added a reviewer: qcolombet.Jul 10 2023, 6:08 AM

Harbormaster completed remote builds in B244117: Diff 538604.Jul 10 2023, 6:31 AM

This Op works synchronous fashion, therefore, it cannot be mapped to existing gpu.alloc that is executed async. The gpu.alloc has host_shared, I guess the intend was to used managed memory. However, it is not used anywhere, so this work removes that.

We are using host_shared downstream. I personally doesn't have preference between flag and separate op (but separate op will require more copypaste in td file), but async argument is not relevant here, as existing gpu.alloc cannot work async either (stream argument is just ignored).

We are using host_shared downstream.

Thanks for the review! I decided to delete it when I didn't see any use of host_shared. Are you using it to allocate managed memory or unified memory?

I personally doesn't have preference between flag and separate op (but separate op will require more copypaste in td file), but async argument is not relevant here, as existing gpu.alloc cannot work async either (stream argument is just ignored).

async can be useful for nvidia targets, one can call`cuMemAllocAsync`. I can put another PR that uses stream at least for nvidia GPUs.

I created a new op because gpu.alloc implements GPU_AsyncOpInterface and managed memory allocation is not asynchronous operation. I don't know any system that does this async.

We are using sycl::malloc_shared for host_shared allocations (and sycl::malloc_device for device ones), sycl doesn't support async allocations, but doing them synchronously will still produce expected observable results, so having dependencies are harmless. IMO, proposed separation is too CUDA-specific (as well as managed name) and GPU dialect is intended as more cross-api dialect. If some implementation doesn't support specific combination of flags and async, it can just safely fallback to synchronous alloc.

In D154839#4486789, @Hardcode84 wrote:

We are using sycl::malloc_shared for host_shared allocations (and sycl::malloc_device for device ones), sycl doesn't support async allocations, but doing them synchronously will still produce expected observable results, so having dependencies are harmless.

Sounds like sycl needs asynchronous allocation.

IMO, proposed separation is too CUDA-specific (as well as managed name) and GPU dialect is intended as more cross-api dialect. If some implementation doesn't support specific combination of flags and async, it can just safely fallback to synchronous alloc.

It is not CUDA specific. I've implemented in Hip which has the same name and implementation.

managed means that the data is managed by the driver or runtime. sycl's malloc_shared does not tell about how the data is shared.

I would wait for other people opinions, but naming aside, I still won't see much reason for this change, you are encoding specific api restriction (I believe, dedicated enough person should be able to implement async allocs even for host_shared/managed mem, something along calling allocation from different thread), you are copypasting entire op (and you forgot to copypaste verifier, and your copypasted canonicalization missing tests and won't work).

rebase
add verifier
dont remove host_shared

guraypp edited the summary of this revision. (Show Details)Jul 17 2023, 3:01 AM

Harbormaster completed remote builds in B245763: Diff 540919.Jul 17 2023, 3:51 AM

fix issues in verifier

Harbormaster completed remote builds in B245804: Diff 540970.Jul 17 2023, 7:47 AM

guraypp added a child revision: D155680: [mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor.Jul 19 2023, 1:24 AM

@Hardcode84 do you think is fine for you? I added a verifier and put back host_shared.

In D154839#4486933, @Hardcode84 wrote:

I would wait for other people opinions, but naming aside, I still won't see much reason for this change,

We need a new Op. As I mentioned gpu.alloc is asynchronous and gpu.malloc_managed cannot be allocated asynchronously.

One alternative solution could be:

Make existing gpu.alloc synchronous Op. Use host_shared for managed memory.
Add a new op gpu.alloc_async for asynchronous allocation.

Either way we need a new Op.

As I said, asynchronous allocation is too tied to specific runtime, the are other runtimes which doesn't support async allocations at all, for SYCL we are doing allocation synchronously regardless of async tokens passed (which may be suboptimal, but will still produce correct result), you can do the same (ignore stream if host_shared is passed). IMO, adding a new op in addition to host_shared just bloats the code and API (btw, your copypasted canonicalization still lacking a tests and won't actually work).

Alternatively, you can just disallow having both host_shared and async tokens on alloc in verifier.

Matt added a subscriber: Matt.Jul 19 2023, 12:56 PM

In D154839#4515847, @Hardcode84 wrote:

As I said, asynchronous allocation is too tied to specific runtime, the are other runtimes which doesn't support async allocations at all,

This isn't important. Allocation can be done asynchronously. The Op is designed to be async. I hope we are on the same page on that. CUDA is just an example.

for SYCL we are doing allocation synchronously regardless of async tokens passed

It sounds incorrect to me.

(which may be suboptimal, but will still produce correct result),

Running parallel program sequentially would also produce correct result.

you can do the same (ignore stream if host_shared is passed). IMO,

Are you proposing to continue incorrect path?

adding a new op in addition to host_shared just bloats the code and API

The PR deleted host_shared initially. I put it back because you asked. I think we should delete it!

(btw, your copypasted canonicalization still lacking a tests and won't actually work).

This is only productive comment from your side. I will ask you to elaborate why will not work?

The Op is designed to be async.

This op can work either sync or async, depending if it have async tokens or not, quote from the doc:

If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it also returns a !gpu.async.token.

host_shared without tokens will give you exact semantics you want.

It sounds incorrect to me.

There is nothing we can do there (and same will apply to OpenCL)

you can do the same (ignore stream if host_shared is passed). IMO,
Are you proposing to continue incorrect path?

I don't see a big issue here (especially considering no one bothered to actually change CUDA impl to async for years)

Anyway, I'm not a GPU code owner and I want to hear other people opinions (@csigg and @bondhugula reviewed original host_shared flag).

@guraypp

Also, do you have any specific usecase, which benefits from async allocs? It would be interesting to look at it.

In D154839#4516561, @Hardcode84 wrote:

@guraypp

Also, do you have any specific usecase, which benefits from async allocs? It would be interesting to look at it.

Large memory allocation is expensive. One can imagine overlapping work on the host and async allocation on the device memory.
This work is not related to async allocs, so I don't have any.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

IR/

GPUOps.td

33 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

63 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

15 lines

ExecutionEngine/

CudaRuntimeWrappers.cpp

7 lines

RocmRuntimeWrappers.cpp

6 lines

test/

Integration/

GPU/

CUDA/

managed.mlir

50 lines

Diff 540970

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 1,161 Lines • ▼ Show 20 Lines	let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) (` ` `host_shared` $hostShared^)? ` `		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) (` ` `host_shared` $hostShared^)? ` `
`(` $dynamicSizes `)` (`` `[` $symbolOperands^ `]`)? attr-dict `:` type($memref)		`(` $dynamicSizes `)` (`` `[` $symbolOperands^ `]`)? attr-dict `:` type($memref)
}];		}];

let hasVerifier = 1;		let hasVerifier = 1;
let hasCanonicalizer = 1;		let hasCanonicalizer = 1;
}		}

		def GPU_AllocManagedOp : GPU_Op<"alloc_managed", []> {

		let summary = "Managed memory allocation operation.";
		let description = [{
		The `gpu.alloc_managed` operation allocates a region of memory that is
		visible to GPU and CPU. It is similar to the `memref.alloc` op, but
		allocated data automatically migrated between the memory spaces by driver
		via page faults or software.

		`attachHost` memory cannot be accessed by any stream on any device.

		Example:

		```mlir
		%memref = gpu.alloc_managed (%width) : memref<64x?xf32, 1>
		```
		}];

		let arguments = (ins Variadic<Index>:$dynamicSizes,
		OptionalAttr<UnitAttr>:$attachHost);
		let results = (outs Res<AnyMemRef, "", [MemAlloc]>:$memref);

		let extraClassDeclaration = [{
		MemRefType getType() { return ::llvm::cast<MemRefType>(getMemref().getType()); }
		}];

		let assemblyFormat = [{
		`(` $dynamicSizes `)` attr-dict `:` type($memref)
		}];
		let hasCanonicalizer = 1;
		let hasVerifier = 1;
		}

def GPU_DeallocOp : GPU_Op<"dealloc", [GPU_AsyncOpInterface]> {		def GPU_DeallocOp : GPU_Op<"dealloc", [GPU_AsyncOpInterface]> {

let summary = "GPU memory deallocation operation";		let summary = "GPU memory deallocation operation";

let description = [{		let description = [{
The `gpu.dealloc` operation frees the region of memory referenced by a		The `gpu.dealloc` operation frees the region of memory referenced by a
memref which was originally created by the `gpu.alloc` operation. It is		memref which was originally created by the `gpu.alloc` operation. It is
similar to the `memref.dealloc` op, but supports asynchronous GPU execution.		similar to the `memref.dealloc` op, but supports asynchronous GPU execution.
▲ Show 20 Lines • Show All 937 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	FunctionCallBuilder hostUnregisterCallBuilder = {
{llvmIntPtrType /* intptr_t rank */,		{llvmIntPtrType /* intptr_t rank */,
llvmPointerType /* void memrefDesc /,		llvmPointerType /* void memrefDesc /,
llvmIntPtrType /* intptr_t elementSizeBytes */}};		llvmIntPtrType /* intptr_t elementSizeBytes */}};
FunctionCallBuilder allocCallBuilder = {		FunctionCallBuilder allocCallBuilder = {
"mgpuMemAlloc",		"mgpuMemAlloc",
llvmPointerType /* void * */,		llvmPointerType /* void * */,
{llvmIntPtrType /* intptr_t sizeBytes */,		{llvmIntPtrType /* intptr_t sizeBytes */,
llvmPointerType /* void stream /}};		llvmPointerType /* void stream /}};
		FunctionCallBuilder allocManagedCallBuilder = {
		"mgpuMemAllocManaged",
		llvmPointerType /* void * */,
		{llvmIntPtrType /* intptr_t sizeBytes */,
		llvmInt32Type /* unsigned flags */}};
FunctionCallBuilder deallocCallBuilder = {		FunctionCallBuilder deallocCallBuilder = {
"mgpuMemFree",		"mgpuMemFree",
llvmVoidType,		llvmVoidType,
{llvmPointerType /* void ptr /, llvmPointerType /* void stream /}};		{llvmPointerType /* void ptr /, llvmPointerType /* void stream /}};
FunctionCallBuilder memcpyCallBuilder = {		FunctionCallBuilder memcpyCallBuilder = {
"mgpuMemcpy",		"mgpuMemcpy",
llvmVoidType,		llvmVoidType,
{llvmPointerType /* void dst /, llvmPointerType /* void src /,		{llvmPointerType /* void dst /, llvmPointerType /* void src /,
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	ConvertAllocOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToGpuRuntimeCallPattern<gpu::AllocOp>(typeConverter) {}		: ConvertOpToGpuRuntimeCallPattern<gpu::AllocOp>(typeConverter) {}

private:		private:
LogicalResult		LogicalResult
matchAndRewrite(gpu::AllocOp allocOp, OpAdaptor adaptor,		matchAndRewrite(gpu::AllocOp allocOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;
};		};

		/// A rewrite pattern to convert gpu.alloc_managed operations into a GPU
		/// runtime call. Currently it supports CUDA and ROCm (HIP).
		class ConvertAllocManagedOpToGpuRuntimeCallPattern
		: public ConvertOpToGpuRuntimeCallPattern<gpu::AllocManagedOp> {
		public:
		ConvertAllocManagedOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
		: ConvertOpToGpuRuntimeCallPattern<gpu::AllocManagedOp>(typeConverter) {}

		private:
		LogicalResult
		matchAndRewrite(gpu::AllocManagedOp allocOp, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override;
		};

/// A rewrite pattern to convert gpu.dealloc operations into a GPU runtime		/// A rewrite pattern to convert gpu.dealloc operations into a GPU runtime
/// call. Currently it supports CUDA and ROCm (HIP).		/// call. Currently it supports CUDA and ROCm (HIP).
class ConvertDeallocOpToGpuRuntimeCallPattern		class ConvertDeallocOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::DeallocOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::DeallocOp> {
public:		public:
ConvertDeallocOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)		ConvertDeallocOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToGpuRuntimeCallPattern<gpu::DeallocOp>(typeConverter) {}		: ConvertOpToGpuRuntimeCallPattern<gpu::DeallocOp>(typeConverter) {}

▲ Show 20 Lines • Show All 541 Lines • ▼ Show 20 Lines	LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
auto memRefDescriptor = this->createMemRefDescriptor(		auto memRefDescriptor = this->createMemRefDescriptor(
loc, memRefType, allocatedPtr, alignedPtr, shape, strides, rewriter);		loc, memRefType, allocatedPtr, alignedPtr, shape, strides, rewriter);

rewriter.replaceOp(allocOp, {memRefDescriptor, stream});		rewriter.replaceOp(allocOp, {memRefDescriptor, stream});

return success();		return success();
}		}

		LogicalResult ConvertAllocManagedOpToGpuRuntimeCallPattern::matchAndRewrite(
		gpu::AllocManagedOp allocOp, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const {
		MemRefType memRefType = allocOp.getType();

		if (failed(areAllLLVMTypes(allocOp, adaptor.getOperands(), rewriter)) \|\|
		!isConvertibleAndHasIdentityMaps(memRefType))
		return failure();

		auto loc = allocOp.getLoc();

		// Get shape of the memref as values: static sizes are constant
		// values and dynamic sizes are passed to 'alloc' as operands.
		SmallVector<Value, 4> shape;
		SmallVector<Value, 4> strides;
		Value sizeBytes;
		getMemRefDescriptorSizes(loc, memRefType, adaptor.getDynamicSizes(), rewriter,
		shape, strides, sizeBytes);

		// Allocate the underlying buffer and store a pointer to it in the MemRef
		// descriptor.
		Type elementPtrType = this->getElementPtrType(memRefType);
		Value flags = rewriter.create<LLVM::ConstantOp>(
		loc, llvmInt32Type, allocOp.getAttachHost() ? 2 : 1);
		Value allocatedPtr =
		allocManagedCallBuilder.create(loc, rewriter, {sizeBytes, flags})
		.getResult();
		if (!getTypeConverter()->useOpaquePointers())
		allocatedPtr =
		rewriter.create<LLVM::BitcastOp>(loc, elementPtrType, allocatedPtr);

		// No alignment.
		Value alignedPtr = allocatedPtr;

		// Create the MemRef descriptor.
		auto memRefDescriptor = this->createMemRefDescriptor(
		loc, memRefType, allocatedPtr, alignedPtr, shape, strides, rewriter);

		rewriter.replaceOp(allocOp, {memRefDescriptor});

		return success();
		}

LogicalResult ConvertDeallocOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertDeallocOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::DeallocOp deallocOp, OpAdaptor adaptor,		gpu::DeallocOp deallocOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(deallocOp, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(deallocOp, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, deallocOp)))		failed(isAsyncWithOneDependency(rewriter, deallocOp)))
return failure();		return failure();

Location loc = deallocOp.getLoc();		Location loc = deallocOp.getLoc();
▲ Show 20 Lines • Show All 864 Lines • ▼ Show 20 Lines	void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation,		StringRef gpuBinaryAnnotation,
bool kernelBarePtrCallConv) {		bool kernelBarePtrCallConv) {
addOpaquePointerConversion<gpu::AsyncTokenType>(converter);		addOpaquePointerConversion<gpu::AsyncTokenType>(converter);
addOpaquePointerConversion<gpu::SparseDnTensorHandleType>(converter);		addOpaquePointerConversion<gpu::SparseDnTensorHandleType>(converter);
addOpaquePointerConversion<gpu::SparseSpMatHandleType>(converter);		addOpaquePointerConversion<gpu::SparseSpMatHandleType>(converter);

patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,		patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
		ConvertAllocManagedOpToGpuRuntimeCallPattern,
ConvertDeallocOpToGpuRuntimeCallPattern,		ConvertDeallocOpToGpuRuntimeCallPattern,
ConvertHostRegisterOpToGpuRuntimeCallPattern,		ConvertHostRegisterOpToGpuRuntimeCallPattern,
ConvertHostUnregisterOpToGpuRuntimeCallPattern,		ConvertHostUnregisterOpToGpuRuntimeCallPattern,
ConvertMemcpyOpToGpuRuntimeCallPattern,		ConvertMemcpyOpToGpuRuntimeCallPattern,
ConvertMemsetOpToGpuRuntimeCallPattern,		ConvertMemsetOpToGpuRuntimeCallPattern,
ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,		ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,
ConvertWaitAsyncOpToGpuRuntimeCallPattern,		ConvertWaitAsyncOpToGpuRuntimeCallPattern,
ConvertWaitOpToGpuRuntimeCallPattern,		ConvertWaitOpToGpuRuntimeCallPattern,
Show All 18 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 1,724 Lines • ▼ Show 20 Lines	LogicalResult AllocOp::verify() {
if (getSymbolOperands().size() != numSymbols) {		if (getSymbolOperands().size() != numSymbols) {
return emitOpError(		return emitOpError(
"symbol operand count does not equal memref symbol count");		"symbol operand count does not equal memref symbol count");
}		}

return success();		return success();
}		}

		LogicalResult AllocManagedOp::verify() {
		auto memRefType = llvm::cast<MemRefType>(getMemref().getType());

		if (static_cast<int64_t>(getDynamicSizes().size()) !=
		memRefType.getNumDynamicDims())
		return emitOpError("dimension operand count does not equal memref "
		"dynamic dimension count");
		return success();
		}

namespace {		namespace {

/// Folding of memref.dim(gpu.alloc(%size), %idx) -> %size similar to		/// Folding of memref.dim(gpu.alloc(%size), %idx) -> %size similar to
/// `memref::AllocOp`.		/// `memref::AllocOp`.
struct SimplifyDimOfAllocOp : public OpRewritePattern<memref::DimOp> {		struct SimplifyDimOfAllocOp : public OpRewritePattern<memref::DimOp> {
using OpRewritePattern<memref::DimOp>::OpRewritePattern;		using OpRewritePattern<memref::DimOp>::OpRewritePattern;

LogicalResult matchAndRewrite(memref::DimOp dimOp,		LogicalResult matchAndRewrite(memref::DimOp dimOp,
Show All 19 Lines

} // namespace		} // namespace

void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,		void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,
MLIRContext *context) {		MLIRContext *context) {
results.add<SimplifyDimOfAllocOp>(context);		results.add<SimplifyDimOfAllocOp>(context);
}		}

		void AllocManagedOp::getCanonicalizationPatterns(RewritePatternSet &results,
		MLIRContext *context) {
		results.add<SimplifyDimOfAllocOp>(context);
		}

#include "mlir/Dialect/GPU/IR/GPUOpInterfaces.cpp.inc"		#include "mlir/Dialect/GPU/IR/GPUOpInterfaces.cpp.inc"
#include "mlir/Dialect/GPU/IR/GPUOpsEnums.cpp.inc"		#include "mlir/Dialect/GPU/IR/GPUOpsEnums.cpp.inc"

#define GET_ATTRDEF_CLASSES		#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/GPU/IR/GPUOpsAttributes.cpp.inc"		#include "mlir/Dialect/GPU/IR/GPUOpsAttributes.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/GPU/IR/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/IR/GPUOps.cpp.inc"

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

	Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines

	extern "C" void mgpuMemAlloc(uint64_t sizeBytes, CUstream /stream*/) {			extern "C" void mgpuMemAlloc(uint64_t sizeBytes, CUstream /stream*/) {
	ScopedContext scopedContext;			ScopedContext scopedContext;
	CUdeviceptr ptr;			CUdeviceptr ptr;
	CUDA_REPORT_IF_ERROR(cuMemAlloc(&ptr, sizeBytes));			CUDA_REPORT_IF_ERROR(cuMemAlloc(&ptr, sizeBytes));
	return reinterpret_cast<void *>(ptr);			return reinterpret_cast<void *>(ptr);
	}			}

				extern "C" void *mgpuMemAllocManaged(uint64_t sizeBytes, unsigned int flags) {
				ScopedContext scopedContext;
				CUdeviceptr sharedPtr;
				CUDA_REPORT_IF_ERROR(cuMemAllocManaged(&sharedPtr, sizeBytes, flags));
				return reinterpret_cast<void *>(sharedPtr);
				}

	extern "C" void mgpuMemFree(void ptr, CUstream /stream*/) {			extern "C" void mgpuMemFree(void ptr, CUstream /stream*/) {
	CUDA_REPORT_IF_ERROR(cuMemFree(reinterpret_cast<CUdeviceptr>(ptr)));			CUDA_REPORT_IF_ERROR(cuMemFree(reinterpret_cast<CUdeviceptr>(ptr)));
	}			}

	extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,			extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,
	CUstream stream) {			CUstream stream) {
	CUDA_REPORT_IF_ERROR(cuMemcpyAsync(reinterpret_cast<CUdeviceptr>(dst),			CUDA_REPORT_IF_ERROR(cuMemcpyAsync(reinterpret_cast<CUdeviceptr>(dst),
	reinterpret_cast<CUdeviceptr>(src),			reinterpret_cast<CUdeviceptr>(src),
	▲ Show 20 Lines • Show All 473 Lines • Show Last 20 Lines

mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp

	Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines
	}			}

	extern "C" void mgpuMemAlloc(uint64_t sizeBytes, hipStream_t /stream*/) {			extern "C" void mgpuMemAlloc(uint64_t sizeBytes, hipStream_t /stream*/) {
	void *ptr;			void *ptr;
	HIP_REPORT_IF_ERROR(hipMalloc(&ptr, sizeBytes));			HIP_REPORT_IF_ERROR(hipMalloc(&ptr, sizeBytes));
	return ptr;			return ptr;
	}			}

				extern "C" void *mgpuMemAllocManaged(uint64_t sizeBytes, unsigned int flags) {
				void *sharedPtr;
				HIP_REPORT_IF_ERROR(hipMallocManaged(&sharedPtr, sizeBytes));
				return sharedPtr;
				}

	extern "C" void mgpuMemFree(void ptr, hipStream_t /stream*/) {			extern "C" void mgpuMemFree(void ptr, hipStream_t /stream*/) {
	HIP_REPORT_IF_ERROR(hipFree(ptr));			HIP_REPORT_IF_ERROR(hipFree(ptr));
	}			}

	extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,			extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,
	hipStream_t stream) {			hipStream_t stream) {
	HIP_REPORT_IF_ERROR(			HIP_REPORT_IF_ERROR(
	hipMemcpyAsync(dst, src, sizeBytes, hipMemcpyDefault, stream));			hipMemcpyAsync(dst, src, sizeBytes, hipMemcpyDefault, stream));
	▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

mlir/test/Integration/GPU/CUDA/managed.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -gpu-kernel-outlining \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin))' \
				// RUN: \| mlir-opt -gpu-async-region -gpu-to-llvm \
				// RUN: \| mlir-opt -async-to-async-runtime -async-runtime-ref-counting \
				// RUN: \| mlir-opt -convert-async-to-llvm -convert-func-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_cuda_runtime \
				// RUN: --shared-libs=%mlir_async_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --entry-point-result=void -O0 \
				// RUN: \| FileCheck %s

				// CHECK: [42, 412]
				// CHECK: Hello from GPU data[0]=42
				// CHECK: Hello from GPU data[1]=412
				// CHECK: [42, 454]

				func.func @main() {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%count = arith.constant 2 : index

				// initialize h0 on host
				%sharedPtr = gpu.alloc_managed(%count) : memref<?xi32>
				%h0_unranked = memref.cast %sharedPtr : memref<?xi32> to memref<*xi32>

				%v0 = arith.constant 42 : i32
				%v1 = arith.constant 412 : i32
				memref.store %v0, %sharedPtr[%c0] : memref<?xi32>
				memref.store %v1, %sharedPtr[%c1] : memref<?xi32>


				call @printMemrefI32(%h0_unranked) : (memref<*xi32>) -> ()

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1, %block_z = %c1) {
				%v2 = memref.load %sharedPtr[%c0] : memref<?xi32>
				%v3 = memref.load %sharedPtr[%c1] : memref<?xi32>
				%sum = arith.addi %v2, %v3 : i32
				gpu.printf "Hello from GPU data[%lld]=%d \n" %c0, %v2 : index, i32
				gpu.printf "Hello from GPU data[%lld]=%d \n" %c1, %v3 : index, i32
				memref.store %sum, %sharedPtr[%c1] : memref<?xi32>
				gpu.terminator
				}

				call @printMemrefI32(%h0_unranked) : (memref<*xi32>) -> ()
				return
				}
				func.func private @printMemrefI32(memref<*xi32>)

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add `gpu.alloc_managed`Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 540970

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp

mlir/test/Integration/GPU/CUDA/managed.mlir

[mlir][gpu] Add `gpu.alloc_managed`
Needs ReviewPublic