As we confirm with the Nvidia people that it is fine to create the environment handle once in the program and use it in multiple streams, I create this revision to rework the environment initialization/destruction to mimic module load/unload mechanism without passing the environment handle around in the dialect.
This allows 1) handy reuse of CUDA environment handle in CUDARuntimeWrapper throughout the instance lifetime, and 2) removal of environment handle in the IR
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
1558 ↗ | (On Diff #535118) | I am even wondering if we can't just get rid of these two GPU ops altogether, and demand that the client (sparse compiler in this case) generates a proper cudaRTWrapper at the module start and end. That way, we can also document thread safety issue, i.e. single thread setup/breakdown |
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp | ||
109 | this feels very thread unsafe! I would expect something like a static initializer to take care of this. | |
318–319 | I think we should not have this create/destroy, but a simple startModule/endModule (with restrictions that they are called in certain ways) | |
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir | ||
23 ↗ | (On Diff #535118) | it does not make sense to get a token on this anymore, as Peiming said above |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
1558 ↗ | (On Diff #535118) | These comments make sense. I will address them. Thank you both! |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
1558 ↗ | (On Diff #535118) | good catch! Now it is completely gone. And because we don't need to pass stream to the @mgpu call, we determine to use llvm.call to initialize and destroy environments, and completely remove the GPU dialect environment handle creation/destruction op |
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp | ||
---|---|---|
109 | I moved the initialization into the create env functions. How does it look? |
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp | ||
---|---|---|
83 | Remove all this scoped stuff and initializer bool all together in favor of (1) a single static handle handle then you can simply have Single handle shared between all cuSparse calls. The client is responsible void mgpuCreateSparseEnv() } assert(!handle); CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))); } extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() { assert(handle); CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle)) handle = nullptr; } and then all methods have extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void *a, void *x, void *y, int32_t ctp, void *buf, CUstream /*stream*/) { assert(handle) && "client did not call mgpuCreateSparseEnv()"; | |
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir | ||
34 ↗ | (On Diff #536352) | can we move this all the way up to main() just as illustration of what a sparse compiler should generate? |
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir | ||
---|---|---|
34 ↗ | (On Diff #536352) | Good point! I am working on it. |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
1563 ↗ | (On Diff #536384) | note that we can now also remove the EnvHandle type completely from google3/third_party/llvm/llvm-project/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td |
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp | ||
---|---|---|
63 ↗ | (On Diff #536384) | I believe codegen already provides createFuncCall(rewriter, loc, "foo", {}, {}, EmitCInterface::Off); for this? but, not needed per our convention |
486 ↗ | (On Diff #536384) | remove all this we assume client will do this |
645 ↗ | (On Diff #536384) | yeah, none of the env stuff in this file |
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp | ||
319 | document that scoped context is for cinit | |
320 | I would do assert(!cusparse_env) so we detect double calls in debug mode | |
539 | assert | |
mlir/test/Dialect/SparseTensor/GPU/gpu_matvec_lib.mlir | ||
46 ↗ | (On Diff #536384) | don't |
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp | ||
---|---|---|
39 ↗ | (On Diff #536417) | add empty line back |
mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir | ||
25 ↗ | (On Diff #536417) | I would completely remove the calls here |
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir | ||
32 ↗ | (On Diff #536417) | same here, since we do not run this code, just leave out the create/destroy calls form the IR |
mlir/test/Dialect/GPU/ops.mlir | ||
332 ↗ | (On Diff #536417) | omit |
359 ↗ | (On Diff #536417) | omit |
Thanks for addressing all my comments so patiently. Looks great to me!
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp | ||
---|---|---|
39 ↗ | (On Diff #536417) | still there? |
mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp | ||
---|---|---|
658 ↗ | (On Diff #536446) | This should not have been removed. Sending out a fix. |
Remove all this scoped stuff and initializer bool all together in favor of
(1) a single static handle handle
(2) create/destroy methods
then you can simply have
Single handle shared between all cuSparse calls. The client is responsible
for calling mgpuCreateSparseEnv() and d mgpuDestroySparseEnv()
// on entering and exiting the module containing sparsified GPU code.
static cusparseHandle_t env = nullptr;
void mgpuCreateSparseEnv() }
}
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() {
}
and then all methods have
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void *a, void *x,