This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] rework CUDA sparse libs environment handle
ClosedPublic

Authored by K-Wu on Jun 16 2023, 2:38 PM.

Download Raw Diff

Details

Reviewers

ftynse
aartbik
bondhugula
ThomasRaoux
dcaballe
Peiming
wrengr
bixia
yinying-lisa-li
nicolasvasilache
herhut

Commits

rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime

Summary

As we confirm with the Nvidia people that it is fine to create the environment handle once in the program and use it in multiple streams, I create this revision to rework the environment initialization/destruction to mimic module load/unload mechanism without passing the environment handle around in the dialect.
This allows 1) handy reuse of CUDA environment handle in CUDARuntimeWrapper throughout the instance lifetime, and 2) removal of environment handle in the IR

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

K-Wu created this revision.Jun 16 2023, 2:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2023, 2:38 PM

Herald added subscribers: bviyer, Moerafaat, bzcheeseman and 22 others. · View Herald Transcript

working

reformat

working runtime

Harbormaster completed remote builds in B239543: Diff 532312.Jun 16 2023, 5:46 PM

rebase origin/main

Harbormaster completed remote builds in B241539: Diff 535051.Jun 27 2023, 12:19 PM

rm handles in the GPU dialect

Herald added a reviewer: ftynse. · View Herald TranscriptJun 27 2023, 1:31 PM

Herald added a reviewer: aartbik. · View Herald Transcript

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, hanchung and 5 others. · View Herald Transcript

K-Wu retitled this revision from [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime to [mlir][sparse][gpu] rework CUDA environment handle throughout instance lifetime.Jun 27 2023, 1:34 PM

K-Wu edited the summary of this revision. (Show Details)

rm useless None flag

K-Wu edited the summary of this revision. (Show Details)Jun 27 2023, 1:37 PM

K-Wu added reviewers: Peiming, wrengr, bixia, yinying-lisa-li.

K-Wu published this revision for review.Jun 27 2023, 1:39 PM

K-Wu retitled this revision from [mlir][sparse][gpu] rework CUDA environment handle throughout instance lifetime to [mlir][sparse][gpu] rework CUDA sparse libs environment handle.

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 27 2023, 1:39 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Peiming added inline comments.Jun 27 2023, 1:55 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558 ↗	(On Diff #535118)	Then, this is no longer an Async operation right?
1587 ↗	(On Diff #535118)	ditto

Harbormaster completed remote builds in B241591: Diff 535118.Jun 27 2023, 3:26 PM

aartbik added inline comments.Jun 27 2023, 4:37 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558 ↗	(On Diff #535118)	I am even wondering if we can't just get rid of these two GPU ops altogether, and demand that the client (sparse compiler in this case) generates a proper cudaRTWrapper at the module start and end. That way, we can also document thread safety issue, i.e. single thread setup/breakdown
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
109	this feels very thread unsafe! I would expect something like a static initializer to take care of this. Right now, the setup is done on an executing thread, so having more than one gets in trouble
318–319	I think we should not have this create/destroy, but a simple startModule/endModule (with restrictions that they are called in certain ways) and then use the handle below.
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir
23 ↗	(On Diff #535118)	it does not make sense to get a token on this anymore, as Peiming said above but more importantly, I think we need a module level setup

K-Wu added inline comments.Jun 27 2023, 4:50 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558 ↗	(On Diff #535118)	These comments make sense. I will address them. Thank you both!

rebase origin/main

Harbormaster completed remote builds in B241928: Diff 535568.Jun 28 2023, 6:05 PM

rm create/destroy sparse env from gpu dialect

clean up

K-Wu marked 5 inline comments as done.Jun 30 2023, 11:49 AM

K-Wu added inline comments.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558 ↗	(On Diff #535118)	good catch! Now it is completely gone. And because we don't need to pass stream to the @mgpu call, we determine to use llvm.call to initialize and destroy environments, and completely remove the GPU dialect environment handle creation/destruction op

addressing thread safety hopefully

K-Wu added inline comments.Jun 30 2023, 11:53 AM

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
109	I moved the initialization into the create env functions. How does it look?

fix compile error

fix test error

aartbik added inline comments.Jun 30 2023, 12:17 PM

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
83	Remove all this scoped stuff and initializer bool all together in favor of (1) a single static handle handle (2) create/destroy methods then you can simply have Single handle shared between all cuSparse calls. The client is responsible for calling mgpuCreateSparseEnv() and d mgpuDestroySparseEnv() // on entering and exiting the module containing sparsified GPU code. static cusparseHandle_t env = nullptr; void mgpuCreateSparseEnv() } assert(!handle); CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))); } extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() { assert(handle); CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle)) handle = nullptr; } and then all methods have extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void a, void x, void y, int32_t ctp, void buf, CUstream /stream/) { assert(handle) && "client did not call mgpuCreateSparseEnv()";
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir
34 ↗	(On Diff #536352)	can we move this all the way up to main() just as illustration of what a sparse compiler should generate?

addressing comments

K-Wu marked an inline comment as done.Jun 30 2023, 12:43 PM

K-Wu added inline comments.

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir
34 ↗	(On Diff #536352)	Good point! I am working on it.

add runtime check

add todo

addressing comments

K-Wu added inline comments.Jun 30 2023, 12:53 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
533 ↗	(On Diff #536380)	I also noted TODO here
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
83	Addressed. Let me know you thoughts!

add some doc

aartbik added inline comments.Jun 30 2023, 1:46 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1563 ↗	(On Diff #536384)	note that we can now also remove the EnvHandle type completely from google3/third_party/llvm/llvm-project/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td

removing unused sparse env handle type

K-Wu marked 2 inline comments as done.Jun 30 2023, 1:49 PM

fixing error

removing emitting llvm.calls in SparseGPUCodegen.cpp

aartbik added inline comments.Jun 30 2023, 1:53 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
63 ↗	(On Diff #536384)	I believe codegen already provides createFuncCall(rewriter, loc, "foo", {}, {}, EmitCInterface::Off); for this? but, not needed per our convention
486 ↗	(On Diff #536384)	remove all this we assume client will do this
645 ↗	(On Diff #536384)	yeah, none of the env stuff in this file
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
319	document that scoped context is for cinit
320	I would do assert(!cusparse_env) so we detect double calls in debug mode
539	assert
mlir/test/Dialect/SparseTensor/GPU/gpu_matvec_lib.mlir
46 ↗	(On Diff #536384)	don't

fix test errors

K-Wu marked 7 inline comments as done.Jun 30 2023, 2:23 PM

addressing comments

aartbik added inline comments.Jun 30 2023, 2:23 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
39 ↗	(On Diff #536417)	add empty line back
mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir
25 ↗	(On Diff #536417)	I would completely remove the calls here
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir
32 ↗	(On Diff #536417)	same here, since we do not run this code, just leave out the create/destroy calls form the IR
mlir/test/Dialect/GPU/ops.mlir
332 ↗	(On Diff #536417)	omit
359 ↗	(On Diff #536417)	omit

no init in destroy handle func now

addressing comments

K-Wu marked 3 inline comments as done.Jun 30 2023, 2:40 PM

Thanks for addressing all my comments so patiently. Looks great to me!

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
39 ↗	(On Diff #536417)	still there?

This revision is now accepted and ready to land.Jun 30 2023, 2:50 PM

This revision was landed with ongoing or failed builds.Jun 30 2023, 2:53 PM

Closed by commit rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime (authored by K-Wu). · Explain Why

This revision was automatically updated to reflect the committed changes.

K-Wu added a commit: rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime.

Harbormaster completed remote builds in B242566: Diff 536441.Jun 30 2023, 4:07 PM

aartbik added inline comments.Jul 5 2023, 10:07 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
658 ↗	(On Diff #536446)	This should not have been removed. Sending out a fix.

aartbik mentioned this in D154564: [mlir][sparse][gpu] fix missing dealloc.Jul 5 2023, 10:21 PM

aartbik mentioned this in rG03125e6894f8: [mlir][sparse][gpu] fix missing dealloc.Jul 6 2023, 9:48 AM

Revision Contents

Path

Size

mlir/

lib/

ExecutionEngine/

CudaRuntimeWrappers.cpp

132 lines

Diff 532312

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	ScopedContext() {
}();		}();

CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));		CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));
}		}

~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }		~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }
};		};

		#ifdef MLIR_ENABLE_CUDA_CUSPARSE
		// Create the cusparse handles once for the duration of the instance
		aartbikUnsubmitted Done Reply Inline Actions Remove all this scoped stuff and initializer bool all together in favor of (1) a single static handle handle (2) create/destroy methods then you can simply have Single handle shared between all cuSparse calls. The client is responsible for calling mgpuCreateSparseEnv() and d mgpuDestroySparseEnv() // on entering and exiting the module containing sparsified GPU code. static cusparseHandle_t env = nullptr; void mgpuCreateSparseEnv() } assert(!handle); CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))); } extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() { assert(handle); CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle)) handle = nullptr; } and then all methods have extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void a, void x, void y, int32_t ctp, void buf, CUstream /stream/) { assert(handle) && "client did not call mgpuCreateSparseEnv()"; aartbik: Remove all this scoped stuff and initializer bool all together in favor of (1) a single…
		K-WuAuthorUnsubmitted Done Reply Inline Actions Addressed. Let me know you thoughts! K-Wu: Addressed. Let me know you thoughts!
		class ScopedCuSparseHandleStorage {
		public:
		static cusparseHandle_t env;
		static bool initiated;
		ScopedCuSparseHandleStorage() {
		// Static reference to CUDA cuSparse environment handle
		if (!initiated) {
		CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&env));
		initiated = true;
		}
		}

		~ScopedCuSparseHandleStorage() {}
		};

		cusparseHandle_t ScopedCuSparseHandleStorage::env = nullptr;
		bool ScopedCuSparseHandleStorage::initiated = false;

		#ifdef MLIR_ENABLE_CUDA_CUSPARSELT
		class ScopedCuSparseLtHandleStorage {
		public:
		static cusparseLtHandle_t env;
		static bool initiated;
		ScopedCuSparseLtHandleStorage() {
		// Static reference to CUDA cuSparseLt environment handle
		if (!initiated) {
		aartbikUnsubmitted Done Reply Inline Actions this feels very thread unsafe! I would expect something like a static initializer to take care of this. Right now, the setup is done on an executing thread, so having more than one gets in trouble aartbik: this feels very thread unsafe! I would expect something like a static initializer to take care…
		K-WuAuthorUnsubmitted Done Reply Inline Actions I moved the initialization into the create env functions. How does it look? K-Wu: I moved the initialization into the create env functions. How does it look?
		initiated = true;
		// note that cuSparseLt still uses cusparseStatus_t
		CUSPARSE_REPORT_IF_ERROR(cusparseLtInit(&env));
		}
		}

		~ScopedCuSparseLtHandleStorage() {}
		};

		cusparseLtHandle_t ScopedCuSparseLtHandleStorage::env;
		bool ScopedCuSparseLtHandleStorage::initiated = false;

		#endif // MLIR_ENABLE_CUDA_CUSPARSELT
		#endif // MLIR_ENABLE_CUDA_CUSPARSE

extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {
ScopedContext scopedContext;		ScopedContext scopedContext;
CUmodule module = nullptr;		CUmodule module = nullptr;
CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));		CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));
return module;		return module;
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {
▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	if (dtp == CUDA_R_16BF \|\| dtp == CUDA_C_16BF) { \
(beta##p) = reinterpret_cast<void *>(&(beta##f)); \		(beta##p) = reinterpret_cast<void *>(&(beta##f)); \
} else { \		} else { \
(alpha##p) = reinterpret_cast<void *>(&(alpha##d)); \		(alpha##p) = reinterpret_cast<void *>(&(alpha##d)); \
(beta##p) = reinterpret_cast<void *>(&(beta##d)); \		(beta##p) = reinterpret_cast<void *>(&(beta##d)); \
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *
mgpuCreateSparseEnv(CUstream /stream/) {		mgpuCreateSparseEnv(CUstream /stream/) {
cusparseHandle_t handle = nullptr;		ScopedCuSparseHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))		return reinterpret_cast<void *>(hstorage.env);
		aartbikUnsubmitted Done Reply Inline Actions I think we should not have this create/destroy, but a simple startModule/endModule (with restrictions that they are called in certain ways) and then use the handle below. aartbik: I think we should not have this create/destroy, but a simple startModule/endModule (with…
		aartbikUnsubmitted Done Reply Inline Actions document that scoped context is for cinit aartbik: document that scoped context is for cinit
return reinterpret_cast<void *>(handle);
}		}
		aartbikUnsubmitted Done Reply Inline Actions I would do assert(!cusparse_env) so we detect double calls in debug mode aartbik: I would do assert(!cusparse_env) so we detect double calls in debug mode

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroySparseEnv(void h, CUstream /stream*/) {		mgpuDestroySparseEnv(void h, CUstream /stream*/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle))		CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(hstorage.env))
		hstorage.initiated = false;
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *
mgpuCreateDnVec(intptr_t size, void values, int32_t dtp, CUstream /stream*/) {		mgpuCreateDnVec(intptr_t size, void values, int32_t dtp, CUstream /stream*/) {
cusparseDnVecDescr_t vec = nullptr;		cusparseDnVecDescr_t vec = nullptr;
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
CUSPARSE_REPORT_IF_ERROR(cusparseCreateDnVec(&vec, size, values, dTp))		CUSPARSE_REPORT_IF_ERROR(cusparseCreateDnVec(&vec, size, values, dTp))
return reinterpret_cast<void *>(vec);		return reinterpret_cast<void *>(vec);
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
mgpuDestroySpMat(void m, CUstream /stream*/) {		mgpuDestroySpMat(void m, CUstream /stream*/) {
cusparseSpMatDescr_t mat = reinterpret_cast<cusparseSpMatDescr_t>(m);		cusparseSpMatDescr_t mat = reinterpret_cast<cusparseSpMatDescr_t>(m);
CUSPARSE_REPORT_IF_ERROR(cusparseDestroySpMat(mat))		CUSPARSE_REPORT_IF_ERROR(cusparseDestroySpMat(mat))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t
mgpuSpMVBufferSize(void h, int32_t ma, void a, void x, void y, int32_t ctp,		mgpuSpMVBufferSize(void h, int32_t ma, void a, void x, void y, int32_t ctp,
CUstream /stream/) {		CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;

cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);		cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);
cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);		cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(cusparseSpMV_bufferSize(
cusparseSpMV_bufferSize(handle, modeA, alphap, matA, vecX, betap, vecY,		hstorage.env, modeA, alphap, matA, vecX, betap, vecY, cTp,
cTp, CUSPARSE_SPMV_ALG_DEFAULT, &bufferSize))		CUSPARSE_SPMV_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(void h, int32_t ma, void a,		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(void h, int32_t ma, void a,
void x, void y,		void x, void y,
int32_t ctp, void *buf,		int32_t ctp, void *buf,
CUstream /stream/) {		CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);
		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);		cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);
cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);		cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSpMV(handle, modeA, alphap, matA, vecX,		CUSPARSE_REPORT_IF_ERROR(cusparseSpMV(hstorage.env, modeA, alphap, matA, vecX,
betap, vecY, cTp,		betap, vecY, cTp,
CUSPARSE_SPMV_ALG_DEFAULT, buf))		CUSPARSE_SPMV_ALG_DEFAULT, buf))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t
mgpuSpMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSpMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,
int32_t ctp, CUstream /stream/) {		int32_t ctp, CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);		cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseSpMM_bufferSize(		CUSPARSE_REPORT_IF_ERROR(cusparseSpMM_bufferSize(
handle, modeA, modeB, alphap, matA, matB, betap, matC, cTp,		hstorage.env, modeA, modeB, alphap, matA, matB, betap, matC, cTp,
CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize))		CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuSpMM(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSpMM(void h, int32_t ma, int32_t mb, void a, void b, void c,
int32_t ctp, void buf, CUstream /stream*/) {		int32_t ctp, void buf, CUstream /stream*/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);		cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSpMM(handle, modeA, modeB, alphap, matA,		CUSPARSE_REPORT_IF_ERROR(cusparseSpMM(hstorage.env, modeA, modeB, alphap,
matB, betap, matC, cTp,		matA, matB, betap, matC, cTp,
CUSPARSE_SPMM_ALG_DEFAULT, buf))		CUSPARSE_SPMM_ALG_DEFAULT, buf))
}		}

// TODO: add support to passing alpha and beta as arguments		// TODO: add support to passing alpha and beta as arguments
extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t
mgpuSDDMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSDDMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,
int32_t ctp, CUstream /stream/) {		int32_t ctp, CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);		cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);		cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);
auto cTp = static_cast<cudaDataType_t>(ctp);		auto cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM_bufferSize(		CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM_bufferSize(
handle, modeA, modeB, alphap, matA, matB, betap, matC, cTp,		hstorage.env, modeA, modeB, alphap, matA, matB, betap, matC, cTp,
CUSPARSE_SDDMM_ALG_DEFAULT, &bufferSize))		CUSPARSE_SDDMM_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuSDDMM(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSDDMM(void h, int32_t ma, int32_t mb, void a, void b, void c,
int32_t ctp, void buf, CUstream /stream*/) {		int32_t ctp, void buf, CUstream /stream*/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);		cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);		cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);
auto cTp = static_cast<cudaDataType_t>(ctp);		auto cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM(handle, modeA, modeB, alphap, matA,		CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM(hstorage.env, modeA, modeB, alphap,
matB, betap, matC, cTp,		matA, matB, betap, matC, cTp,
CUSPARSE_SDDMM_ALG_DEFAULT, buf))		CUSPARSE_SDDMM_ALG_DEFAULT, buf))
}		}

#ifdef MLIR_ENABLE_CUDA_CUSPARSELT		#ifdef MLIR_ENABLE_CUDA_CUSPARSELT

///		///
/// Wrapper methods for the cuSparseLt library.		/// Wrapper methods for the cuSparseLt library.
///		///
Show All 15 Lines
};		};

static_assert(sizeof(cusparseLtHandle_t) == 11024);		static_assert(sizeof(cusparseLtHandle_t) == 11024);
static_assert(sizeof(cusparseLtSpMatHandleAndData) == 44104);		static_assert(sizeof(cusparseLtSpMatHandleAndData) == 44104);
static_assert(sizeof(cusparseLtDnMatHandleAndData) == 11032);		static_assert(sizeof(cusparseLtDnMatHandleAndData) == 11032);

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCreateSparseLtEnv(void h, CUstream /stream*/) {		mgpuCreateSparseLtEnv(void h, CUstream /stream*/) {
// note that cuSparseLt still uses cusparseStatus_t		ScopedCuSparseLtHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(
cusparseLtInit(reinterpret_cast<cusparseLtHandle_t *>(h)))
}		}

		aartbikUnsubmitted Done Reply Inline Actions assert aartbik: assert
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroySparseLtEnv(void h, CUstream /stream*/) {		mgpuDestroySparseLtEnv(void h, CUstream /stream*/) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseLtDestroy(handle))		CUSPARSE_REPORT_IF_ERROR(cusparseLtDestroy(&(hstorage.env)))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCreateCuSparseLtDnMat(void dh, void h, intptr_t rows, intptr_t cols,		mgpuCreateCuSparseLtDnMat(void dh, void h, intptr_t rows, intptr_t cols,
void values, int32_t dtp, CUstream /stream*/) {		void values, int32_t dtp, CUstream /stream*/) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;

// CusparseLt expects the descriptors to be zero-initialized.		// CusparseLt expects the descriptors to be zero-initialized.
memset(dh, 0, sizeof(cusparseLtDnMatHandleAndData));		memset(dh, 0, sizeof(cusparseLtDnMatHandleAndData));
auto dnmat_handle = reinterpret_cast<cusparseLtDnMatHandleAndData *>(dh);		auto dnmat_handle = reinterpret_cast<cusparseLtDnMatHandleAndData *>(dh);
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
// assuming row-major when deciding lda		// assuming row-major when deciding lda
CUSPARSE_REPORT_IF_ERROR(cusparseLtDenseDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtDenseDescriptorInit(
handle, &(dnmat_handle->mat), rows, cols, /lda=/cols,		&(hstorage.env), &(dnmat_handle->mat), rows, cols, /lda=/cols,
/alignment=/16, dTp, CUSPARSE_ORDER_ROW))		/alignment=/16, dTp, CUSPARSE_ORDER_ROW))
dnmat_handle->values = values;		dnmat_handle->values = values;
}		}

// This can be used to destroy both dense matrices and sparse matrices in		// This can be used to destroy both dense matrices and sparse matrices in
// cusparseLt		// cusparseLt
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroyCuSparseLtSpMat(void m, CUstream /stream*/) {		mgpuDestroyCuSparseLtSpMat(void m, CUstream /stream*/) {
Show All 9 Lines

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCusparseLtCreate2To4SpMat(void sh, void h, intptr_t rows, intptr_t cols,		mgpuCusparseLtCreate2To4SpMat(void sh, void h, intptr_t rows, intptr_t cols,
void values, int32_t dtp, CUstream /stream*/) {		void values, int32_t dtp, CUstream /stream*/) {
auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);		auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);
// CusparseLt expects the descriptors to be zero-initialized.		// CusparseLt expects the descriptors to be zero-initialized.
memset(spmat_handle, 0, sizeof(cusparseLtSpMatHandleAndData));		memset(spmat_handle, 0, sizeof(cusparseLtSpMatHandleAndData));
spmat_handle->values = values;		spmat_handle->values = values;
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
// assuming row-major when deciding lda		// assuming row-major when deciding lda
CUSPARSE_REPORT_IF_ERROR(cusparseLtStructuredDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtStructuredDescriptorInit(
handle, &(spmat_handle->mat), rows, cols, /ld=/cols, /alignment=/16,		&(hstorage.env), &(spmat_handle->mat), rows, cols, /ld=/cols,
dTp, CUSPARSE_ORDER_ROW, CUSPARSELT_SPARSITY_50_PERCENT))		/alignment=/16, dTp, CUSPARSE_ORDER_ROW,
		CUSPARSELT_SPARSITY_50_PERCENT))
}		}

// Several things are being done in this stage, algorithm selection, planning,		// Several things are being done in this stage, algorithm selection, planning,
// and returning workspace and compressed matrices data buffer sizes.		// and returning workspace and compressed matrices data buffer sizes.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCuSparseLtSpMMBufferSize(void bs, void h, int32_t ma, int32_t mb, void *a,		mgpuCuSparseLtSpMMBufferSize(void bs, void h, int32_t ma, int32_t mb, void *a,
void b, void c, int32_t ctp,		void b, void c, int32_t ctp,
CUstream /stream/) {		CUstream /stream/) {
// TODO: support more advanced settings, e.g., the input right operand is a		// TODO: support more advanced settings, e.g., the input right operand is a
// sparse matrix assuming matA is the sparse matrix		// sparse matrix assuming matA is the sparse matrix
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);		auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);
auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);		auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);
auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);		auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);
auto workspace_size = reinterpret_cast<size_t *>(bs);		auto workspace_size = reinterpret_cast<size_t *>(bs);
auto compressed_size = &(reinterpret_cast<size_t *>(bs)[1]);		auto compressed_size = &(reinterpret_cast<size_t *>(bs)[1]);
auto compressed_buffer_size = &(reinterpret_cast<size_t *>(bs)[2]);		auto compressed_buffer_size = &(reinterpret_cast<size_t *>(bs)[2]);
auto cTp = static_cast<cusparseComputeType>(ctp);		auto cTp = static_cast<cusparseComputeType>(ctp);

cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulDescriptorInit(
handle, &(matA->matmul), modeA, modeB, &(matA->mat), &(matB->mat),		&(hstorage.env), &(matA->matmul), modeA, modeB, &(matA->mat),
&(matC->mat), &(matC->mat), cTp))		&(matB->mat), &(matC->mat), &(matC->mat), cTp))
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSelectionInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSelectionInit(
handle, &(matA->alg_sel), &(matA->matmul), CUSPARSELT_MATMUL_ALG_DEFAULT))		&(hstorage.env), &(matA->alg_sel), &(matA->matmul),
		CUSPARSELT_MATMUL_ALG_DEFAULT))
int alg = 0;		int alg = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(
handle, &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,		&(hstorage.env), &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,
sizeof(alg)))		sizeof(alg)))

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(
handle, &(matA->plan), &(matA->matmul), &(matA->alg_sel)))		&(hstorage.env), &(matA->plan), &(matA->matmul), &(matA->alg_sel)))

CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulGetWorkspace(
cusparseLtMatmulGetWorkspace(handle, &(matA->plan), workspace_size))		&(hstorage.env), &(matA->plan), workspace_size))
CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(		CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(
handle, &(matA->plan), compressed_size, compressed_buffer_size))		&(hstorage.env), &(matA->plan), compressed_size, compressed_buffer_size))

// avoid zero-alloc		// avoid zero-alloc
workspace_size = (workspace_size == 0 ? 1 : *workspace_size);		workspace_size = (workspace_size == 0 ? 1 : *workspace_size);
compressed_size = (compressed_size == 0 ? 1 : *compressed_size);		compressed_size = (compressed_size == 0 ? 1 : *compressed_size);
*compressed_buffer_size =		*compressed_buffer_size =
(compressed_buffer_size == 0 ? 1 : compressed_buffer_size);		(compressed_buffer_size == 0 ? 1 : compressed_buffer_size);
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCuSparseLtSpMM(void h, void a, void b, void c, void *d_workspace,		mgpuCuSparseLtSpMM(void h, void a, void b, void c, void *d_workspace,
void dA_compressed, void dA_compressedBuffer,		void dA_compressed, void dA_compressedBuffer,
CUstream stream) {		CUstream stream) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);		auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);
auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);		auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);
auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);		auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);

ALPHABETA(CUDA_R_32F, alpha, beta)		ALPHABETA(CUDA_R_32F, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(
cusparseLtSpMMACompress(handle, &(matA->plan), (matA->values),		cusparseLtSpMMACompress(&(hstorage.env), &(matA->plan), (matA->values),
dA_compressed, dA_compressedBuffer, stream))		dA_compressed, dA_compressedBuffer, stream))

// TODO: add support to multi-stream execution		// TODO: add support to multi-stream execution
// Perform the matrix multiplication. D = A*B+C using C==D for now		// Perform the matrix multiplication. D = A*B+C using C==D for now
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(
cusparseLtMatmul(handle, &(matA->plan), alphap, dA_compressed,		cusparseLtMatmul(&(hstorage.env), &(matA->plan), alphap, dA_compressed,
matB->values, betap, matC->values,		matB->values, betap, matC->values,
/dD/ matC->values, d_workspace, nullptr, 0))		/dD/ matC->values, d_workspace, nullptr, 0))

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matA->mat)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matA->mat)))
// destroy the plan associated with the sparse matrix		// destroy the plan associated with the sparse matrix
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanDestroy(&(matA->plan)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanDestroy(&(matA->plan)))
}		}

#endif // MLIR_ENABLE_CUDA_CUSPARSELT		#endif // MLIR_ENABLE_CUDA_CUSPARSELT
#endif // MLIR_ENABLE_CUDA_CUSPARSE		#endif // MLIR_ENABLE_CUDA_CUSPARSE

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] rework CUDA sparse libs environment handleClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 532312

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

[mlir][sparse][gpu] rework CUDA sparse libs environment handle
ClosedPublic