This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add `tma.create.descriptor` to create tensor map descriptor
ClosedPublic

Authored by guraypp on Jul 19 2023, 1:24 AM.

Details

Summary

The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The tensor is the source tensor to be tiled. The boxDimensions is the size of the tiled memory region in each dimension.

The pattern here lowers tma.create.descriptor to a runtime function call that eventually calls calls CUDA Driver's cuTensorMapEncodeTiled. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

Depends on D155453

Diff Detail

Event Timeline

guraypp created this revision.Jul 19 2023, 1:24 AM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
guraypp requested review of this revision.Jul 19 2023, 1:24 AM

Generally looks good, thanks for adding this!

Qq on the dependency on gpu.alloc_managed.
It seems to me the managed alloc is hidden within the tma.create_descriptor op and you would only need a tma.destroy_descriptor to free the underlying memory?

Do you anticipate actually needing this gpu.alloc_managed op for other use cases?

Qq on the dependency on gpu.alloc_managed.
It seems to me the managed alloc is hidden within the tma.create_descriptor op

I hid for the sake of simplicity. I can also call cuMemAlloc + cuMemCpy. I can do that to break the dependency.

and you would only need a tma.destroy_descriptor to free the underlying memory?

That's right, there is need for tma.destroy_descriptor Op with/without managed memory. Thanks for bringing that up!

Do you anticipate actually needing this gpu.alloc_managed op for other use cases?

Managed memory is easy. I like using it when writing IR by hand. It performs as well as manual sync copy if program copies data in the beginning and copies back in the end. Because NVIDIA GPUs have page migration, the hardware keeps pages and has TLBs to cache pages.
It causes thrashing when there are copies between kernels. It's possible to alleviate this with cudaMemAdvise. Also, cudaMemPrefetchAsync can be used to catch the same performance as manul asyncronus copy.

guraypp updated this revision to Diff 542499.Jul 20 2023, 6:48 AM

break dependency to managed memory

nicolasvasilache added inline comments.
mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h
53

Some minor doc comment plz.

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
953

seems this makeConst and the one in the following function could be factored out as a

static Value makeI64Const(...)
955

rewriter.getIntegerType(64) for consistency ?

997

Can we rename this elementTypeAsLLVMConstant ?
I find it confusing to have a type of type Value; this is purely because of LLVM encoding details.

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
366

spurious duplication of C++

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
288

stale comment as we don't use managed memory anymore in this PR?

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
650

nit: newline

This revision is now accepted and ready to land.Jul 20 2023, 11:47 PM
guraypp updated this revision to Diff 542814.Jul 21 2023, 2:13 AM

address comments

guraypp edited the summary of this revision. (Show Details)Jul 21 2023, 2:28 AM
guraypp marked 6 inline comments as done.
This revision was landed with ongoing or failed builds.Jul 21 2023, 2:33 AM
This revision was automatically updated to reflect the committed changes.