This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] rework CUDA sparse libs environment handle
ClosedPublic

Authored by K-Wu on Jun 16 2023, 2:38 PM.

Download Raw Diff

Details

Reviewers

ftynse
aartbik
bondhugula
ThomasRaoux
dcaballe
Peiming
wrengr
bixia
yinying-lisa-li
nicolasvasilache
herhut

Commits

rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime

Summary

As we confirm with the Nvidia people that it is fine to create the environment handle once in the program and use it in multiple streams, I create this revision to rework the environment initialization/destruction to mimic module load/unload mechanism without passing the environment handle around in the dialect.
This allows 1) handy reuse of CUDA environment handle in CUDARuntimeWrapper throughout the instance lifetime, and 2) removal of environment handle in the IR

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

K-Wu created this revision.Jun 16 2023, 2:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2023, 2:38 PM

Herald added subscribers: bviyer, Moerafaat, bzcheeseman and 22 others. · View Herald Transcript

working

reformat

working runtime

Harbormaster completed remote builds in B239543: Diff 532312.Jun 16 2023, 5:46 PM

rebase origin/main

Harbormaster completed remote builds in B241539: Diff 535051.Jun 27 2023, 12:19 PM

rm handles in the GPU dialect

Herald added a reviewer: ftynse. · View Herald TranscriptJun 27 2023, 1:31 PM

Herald added a reviewer: aartbik. · View Herald Transcript

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, hanchung and 5 others. · View Herald Transcript

K-Wu retitled this revision from [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime to [mlir][sparse][gpu] rework CUDA environment handle throughout instance lifetime.Jun 27 2023, 1:34 PM

K-Wu edited the summary of this revision. (Show Details)

rm useless None flag

K-Wu edited the summary of this revision. (Show Details)Jun 27 2023, 1:37 PM

K-Wu added reviewers: Peiming, wrengr, bixia, yinying-lisa-li.

K-Wu published this revision for review.Jun 27 2023, 1:39 PM

K-Wu retitled this revision from [mlir][sparse][gpu] rework CUDA environment handle throughout instance lifetime to [mlir][sparse][gpu] rework CUDA sparse libs environment handle.

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 27 2023, 1:39 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Peiming added inline comments.Jun 27 2023, 1:55 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558	Then, this is no longer an Async operation right?
1587	ditto

Harbormaster completed remote builds in B241591: Diff 535118.Jun 27 2023, 3:26 PM

aartbik added inline comments.Jun 27 2023, 4:37 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558	I am even wondering if we can't just get rid of these two GPU ops altogether, and demand that the client (sparse compiler in this case) generates a proper cudaRTWrapper at the module start and end. That way, we can also document thread safety issue, i.e. single thread setup/breakdown
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
109	this feels very thread unsafe! I would expect something like a static initializer to take care of this. Right now, the setup is done on an executing thread, so having more than one gets in trouble
318–319	I think we should not have this create/destroy, but a simple startModule/endModule (with restrictions that they are called in certain ways) and then use the handle below.
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir
23	it does not make sense to get a token on this anymore, as Peiming said above but more importantly, I think we need a module level setup

K-Wu added inline comments.Jun 27 2023, 4:50 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558	These comments make sense. I will address them. Thank you both!

rebase origin/main

Harbormaster completed remote builds in B241928: Diff 535568.Jun 28 2023, 6:05 PM

rm create/destroy sparse env from gpu dialect

clean up

K-Wu marked 5 inline comments as done.Jun 30 2023, 11:49 AM

K-Wu added inline comments.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1558	good catch! Now it is completely gone. And because we don't need to pass stream to the @mgpu call, we determine to use llvm.call to initialize and destroy environments, and completely remove the GPU dialect environment handle creation/destruction op

addressing thread safety hopefully

K-Wu added inline comments.Jun 30 2023, 11:53 AM

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
109	I moved the initialization into the create env functions. How does it look?

fix compile error

fix test error

aartbik added inline comments.Jun 30 2023, 12:17 PM

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
83	Remove all this scoped stuff and initializer bool all together in favor of (1) a single static handle handle (2) create/destroy methods then you can simply have Single handle shared between all cuSparse calls. The client is responsible for calling mgpuCreateSparseEnv() and d mgpuDestroySparseEnv() // on entering and exiting the module containing sparsified GPU code. static cusparseHandle_t env = nullptr; void mgpuCreateSparseEnv() } assert(!handle); CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))); } extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() { assert(handle); CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle)) handle = nullptr; } and then all methods have extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void a, void x, void y, int32_t ctp, void buf, CUstream /stream/) { assert(handle) && "client did not call mgpuCreateSparseEnv()";
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir
31–35	can we move this all the way up to main() just as illustration of what a sparse compiler should generate?

addressing comments

K-Wu marked an inline comment as done.Jun 30 2023, 12:43 PM

K-Wu added inline comments.

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir
31–35	Good point! I am working on it.

add runtime check

add todo

addressing comments

K-Wu added inline comments.Jun 30 2023, 12:53 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
503	I also noted TODO here
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
83	Addressed. Let me know you thoughts!

add some doc

aartbik added inline comments.Jun 30 2023, 1:46 PM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1635	note that we can now also remove the EnvHandle type completely from google3/third_party/llvm/llvm-project/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td

removing unused sparse env handle type

K-Wu marked 2 inline comments as done.Jun 30 2023, 1:49 PM

fixing error

removing emitting llvm.calls in SparseGPUCodegen.cpp

aartbik added inline comments.Jun 30 2023, 1:53 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
63	I believe codegen already provides createFuncCall(rewriter, loc, "foo", {}, {}, EmitCInterface::Off); for this? but, not needed per our convention
461	remove all this we assume client will do this
611	yeah, none of the env stuff in this file
mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
319	document that scoped context is for cinit
320	I would do assert(!cusparse_env) so we detect double calls in debug mode
539	assert
mlir/test/Dialect/SparseTensor/GPU/gpu_matvec_lib.mlir
48–50	don't

fix test errors

K-Wu marked 7 inline comments as done.Jun 30 2023, 2:23 PM

addressing comments

aartbik added inline comments.Jun 30 2023, 2:23 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
40	add empty line back
mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir
23–27	I would completely remove the calls here
mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir
31	same here, since we do not run this code, just leave out the create/destroy calls form the IR
mlir/test/Dialect/GPU/ops.mlir
330	omit
358	omit

no init in destroy handle func now

addressing comments

K-Wu marked 3 inline comments as done.Jun 30 2023, 2:40 PM

Thanks for addressing all my comments so patiently. Looks great to me!

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
40	still there?

This revision is now accepted and ready to land.Jun 30 2023, 2:50 PM

This revision was landed with ongoing or failed builds.Jun 30 2023, 2:53 PM

Closed by commit rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime (authored by K-Wu). · Explain Why

This revision was automatically updated to reflect the committed changes.

K-Wu added a commit: rGbe2dd22b8f47: [mlir][sparse][gpu] reuse CUDA environment handle throughout instance lifetime.

Harbormaster completed remote builds in B242566: Diff 536441.Jun 30 2023, 4:07 PM

aartbik added inline comments.Jul 5 2023, 10:07 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
658	This should not have been removed. Sending out a fix.

aartbik mentioned this in D154564: [mlir][sparse][gpu] fix missing dealloc.Jul 5 2023, 10:21 PM

aartbik mentioned this in rG03125e6894f8: [mlir][sparse][gpu] fix missing dealloc.Jul 6 2023, 9:48 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

IR/

GPUOps.td

97 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

154 lines

Dialect/

SparseTensor/

Transforms/

SparseGPUCodegen.cpp

64 lines

ExecutionEngine/

CudaRuntimeWrappers.cpp

184 lines

test/

Conversion/

GPUCommon/

lower-2to4-sparse-to-gpu-runtime-calls.mlir

12 lines

lower-sparse-to-gpu-runtime-calls.mlir

30 lines

Dialect/

GPU/

ops.mlir

20 lines

sparse-roundtrip.mlir

60 lines

SparseTensor/

GPU/

gpu_matmul_lib.mlir

12 lines

gpu_matvec_lib.mlir

12 lines

gpu_sampled_matmul_lib.mlir

12 lines

Integration/

Dialect/

SparseTensor/

GPU/

CUDA/

sm80-lt/

sparse-matmul-2-4-lib.mlir

14 lines

Diff 535568

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 1,534 Lines • ▼ Show 20 Lines	let assemblyFormat = [{
$opType $args attr-dict `:` functional-type($args, $res)		$opType $args attr-dict `:` functional-type($args, $res)
}];		}];
}		}

//		//
// Operation on sparse matrices, called from the host		// Operation on sparse matrices, called from the host
// (currently lowers to cuSparse for CUDA only, no ROCM lowering).		// (currently lowers to cuSparse for CUDA only, no ROCM lowering).
//		//
		def GPU_RtLibMode: I32EnumAttr<"RtLibMode",
		"optional GPU runtime libraries to be enabled to support sparse ops",
		[
		I32EnumAttrCase<"CUSPARSE_AND_CUSPARSE_LT", 0>,
		I32EnumAttrCase<"CUSPARSE", 1>,
		]>{
		let genSpecializedAttr = 0;
		let cppNamespace = GPU_Dialect.cppNamespace;
		}

		def GPU_RtLibModeAttr : EnumAttr<GPU_Dialect, GPU_RtLibMode,
		"rtlib_mode">{
		let defaultValue = "RtLibMode::CUSPARSE_AND_CUSPARSE_LT";
		}

def GPU_CreateSparseEnvOp : GPU_Op<"create_sparse_env", [GPU_AsyncOpInterface]> {		def GPU_CreateSparseEnvOp : GPU_Op<"create_sparse_env", [GPU_AsyncOpInterface]> {
		PeimingUnsubmitted Done Reply Inline Actions Then, this is no longer an Async operation right? Peiming: Then, this is no longer an Async operation right?
		K-WuAuthorUnsubmitted Done Reply Inline Actions good catch! Now it is completely gone. And because we don't need to pass stream to the @mgpu call, we determine to use llvm.call to initialize and destroy environments, and completely remove the GPU dialect environment handle creation/destruction op K-Wu: good catch! Now it is completely gone. And because we don't need to pass stream to the @mgpu…
		aartbikUnsubmitted Done Reply Inline Actions I am even wondering if we can't just get rid of these two GPU ops altogether, and demand that the client (sparse compiler in this case) generates a proper cudaRTWrapper at the module start and end. That way, we can also document thread safety issue, i.e. single thread setup/breakdown aartbik: I am even wondering if we can't just get rid of these two GPU ops altogether, and demand that…
		K-WuAuthorUnsubmitted Done Reply Inline Actions These comments make sense. I will address them. Thank you both! K-Wu: These comments make sense. I will address them. Thank you both!
let summary = "Create sparse environment operation";		let summary = "Create sparse environment operation";
let description = [{		let description = [{
The `gpu.create_sparse_env` operation initializes a sparse environment.		The `gpu.initialize_sparse_env` operation initializes a sparse environment.
It must be executed prior to any other sparse operation. The operation		It must be executed prior to any other sparse operation. The operation
returns a handle to the new sparse environment.		takes in the RtLibMode argument to indicate whether cuSparse and cuSparseLt
		will be initialized, respectively.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%env, %token = gpu.create_sparse_env async [%dep]		%token = gpu.create_sparse_env async [%dep] %rtLibMode
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies);		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
let results = (outs Res<GPU_SparseEnvHandle>:$env,		Arg<GPU_RtLibModeAttr>:$rtLibMode);
Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);
let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) attr-dict		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) $rtLibMode attr-dict
}];		}];
}		}

def GPU_DestroySparseEnvOp : GPU_Op<		def GPU_DestroySparseEnvOp : GPU_Op<
"destroy_sparse_env",		"destroy_sparse_env",
[GPU_AsyncOpInterface]> {		[GPU_AsyncOpInterface]> {
		PeimingUnsubmitted Done Reply Inline Actions ditto Peiming: ditto
let summary = "Destroy sparse environment operation";		let summary = "Destroy sparse environment operation";
let description = [{		let description = [{
The `gpu.destroy_sparse_env` operation releases all resources of a sparse		The `gpu.destroy_sparse_env` operation releases all resources of a sparse
environment represented by a handle that was previously created by a		environment represented by the GPU_RtLibMode flag indicating whether cuSparse
`gpu.create_sparse_env` operation.		environment and cuSparseLt's will be destroyed, respectively.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%token = gpu.destroy_sparse_env async [%dep] %env		%token = gpu.destroy_sparse_env async [%dep] %rtLibMode
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
Arg<GPU_SparseEnvHandle>:$env);		Arg<GPU_RtLibModeAttr>:$rtLibMode);
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env attr-dict		$rtLibMode attr-dict
}];		}];
}		}

def GPU_CreateDnTensorOp : GPU_Op<"create_dn_tensor", [GPU_AsyncOpInterface, AttrSizedOperandSegments]> {		def GPU_CreateDnTensorOp : GPU_Op<"create_dn_tensor", [GPU_AsyncOpInterface, AttrSizedOperandSegments]> {
let summary = "Create dense tensor operation";		let summary = "Create dense tensor operation";
let description = [{		let description = [{
The `gpu.create_dn_tensor` operation initializes a dense tensor from		The `gpu.create_dn_tensor` operation initializes a dense tensor from
the given values buffer and sizes. The buffer must already be copied		the given values buffer and sizes. The buffer must already be copied
from the host to the device prior to using this operation. The		from the host to the device prior to using this operation. The
operation returns a handle to the dense tensor descriptor.		operation returns a handle to the dense tensor descriptor.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%dmat, %token = gpu.create_dn_tensor async [%dep] %env, %mem, %dims : index, index into memref<?xf64>		%dmat, %token = gpu.create_dn_tensor async [%dep] %mem, %dims : index, index into memref<?xf64>
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
AnyMemRef:$memref,		AnyMemRef:$memref,
		aartbikUnsubmitted Done Reply Inline Actions note that we can now also remove the EnvHandle type completely from google3/third_party/llvm/llvm-project/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td aartbik: note that we can now also remove the EnvHandle type completely from…
Variadic<Index>:$dims);		Variadic<Index>:$dims);
let results = (outs Res<GPU_SparseDnTensorHandle>:$dnTensor, Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Res<GPU_SparseDnTensorHandle>:$dnTensor, Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $memref `,` $dims attr-dict `:` type($dims) `into` type($memref)		$memref `,` $dims attr-dict `:` type($dims) `into` type($memref)
}];		}];
}		}

def GPU_DestroyDnTensorOp : GPU_Op<"destroy_dn_tensor", [GPU_AsyncOpInterface]> {		def GPU_DestroyDnTensorOp : GPU_Op<"destroy_dn_tensor", [GPU_AsyncOpInterface]> {
let summary = "Destroy dense tensor operation";		let summary = "Destroy dense tensor operation";
let description = [{		let description = [{
The `gpu.destroy_dn_tensor` operation releases all resources of a dense		The `gpu.destroy_dn_tensor` operation releases all resources of a dense
tensor represented by a handle that was previously created by a		tensor represented by a handle that was previously created by a
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	let description = [{

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%spmat, %token = gpu.create_2to4_spmat async [%dep] %env, %rows, %cols, %mem : memref<?xf64>		%spmat, %token = gpu.create_2to4_spmat async [%dep] %rows, %cols, %mem : memref<?xf64>
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
Index:$rows,		Index:$rows,
Index:$cols,		Index:$cols,
AnyMemRef:$memref);		AnyMemRef:$memref);
let results = (outs Res<GPU_SparseSpMatHandle>:$spMat,		let results = (outs Res<GPU_SparseSpMatHandle>:$spMat,
Optional<GPU_AsyncToken>:$asyncToken);		Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $rows `,` $cols `,` $memref attr-dict `:` type($memref)		$rows `,` $cols `,` $memref attr-dict `:` type($memref)
}];		}];
}		}

def GPU_DestroySpMatOp : GPU_Op<"destroy_sp_mat", [GPU_AsyncOpInterface]> {		def GPU_DestroySpMatOp : GPU_Op<"destroy_sp_mat", [GPU_AsyncOpInterface]> {
let summary = "Destroy sparse matrix operation";		let summary = "Destroy sparse matrix operation";
let description = [{		let description = [{
The `gpu.destroy_sp_mat` operation releases all resources of a sparse		The `gpu.destroy_sp_mat` operation releases all resources of a sparse
matrix represented by a handle that was previously created by a		matrix represented by a handle that was previously created by a
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	let description = [{

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.

Example:		Example:

```mlir		```mlir
%buffersz, %token = gpu.spmv_buffer_size async [%dep] %env, %spmatA{TRANSPOSE}, %dnX, %dnY into f32		%buffersz, %token = gpu.spmv_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY into f32
```		```
}];		}];
let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_SparseSpMatHandle:$spmatA,		GPU_SparseSpMatHandle:$spmatA,
GPU_SparseDnTensorHandle:$dnX,		GPU_SparseDnTensorHandle:$dnX,
GPU_SparseDnTensorHandle:$dnY,		GPU_SparseDnTensorHandle:$dnY,
TypeAttr:$computeType);		TypeAttr:$computeType);
let results = (outs Res<Index>:$bufferSz,		let results = (outs Res<Index>:$bufferSz,
Optional<GPU_AsyncToken>:$asyncToken);		Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$bufferSz,		"Type":$bufferSz,
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$spmatA,		"Value":$spmatA,
"Value":$dnX,		"Value":$dnX,
"Value":$dnY,		"Value":$dnY,
"Type":$computeType)		"Type":$computeType)
, [{		, [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, bufferSz, asyncToken, asyncDependencies,		return build($_builder, $_state, bufferSz, asyncToken, asyncDependencies,
env, modeA, spmatA, dnX, dnY, computeType);}]>		modeA, spmatA, dnX, dnY, computeType);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $spmatA (`{` $modeA^ `}`)? `,` $dnX `,` $dnY attr-dict `into` $computeType		$spmatA (`{` $modeA^ `}`)? `,` $dnX `,` $dnY attr-dict `into` $computeType
}];		}];
}		}

def GPU_SpMVOp : GPU_Op<"spmv", [GPU_AsyncOpInterface]> {		def GPU_SpMVOp : GPU_Op<"spmv", [GPU_AsyncOpInterface]> {
let summary = "SpMV operation";		let summary = "SpMV operation";
let description = [{		let description = [{
The `gpu.spmv` operation performs the SpMV operation on the given sparse matrix,		The `gpu.spmv` operation performs the SpMV operation on the given sparse matrix,
dense vectors, and buffer. The operation expects handles returned by previous		dense vectors, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SpMV. The		sparse operations to construct an environment and the operands for SpMV. The
buffer must have been allocated on the device.		buffer must have been allocated on the device.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.

Example:		Example:

```mlir		```mlir
%token = gpu.spmv async [%dep] %env, %spmatA{TRANSPOSE}, %dnX, %dnY : memref<?xf64> into bf16		%token = gpu.spmv async [%dep] %spmatA{TRANSPOSE}, %dnX, %dnY : memref<?xf64> into bf16
```		```
}];		}];
let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_SparseSpMatHandle:$spmatA,		GPU_SparseSpMatHandle:$spmatA,
GPU_SparseDnTensorHandle:$dnX,		GPU_SparseDnTensorHandle:$dnX,
GPU_SparseDnTensorHandle:$dnY,		GPU_SparseDnTensorHandle:$dnY,
TypeAttr:$computeType,		TypeAttr:$computeType,
AnyMemRef:$buffer);		AnyMemRef:$buffer);
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$spmatA,		"Value":$spmatA,
"Value":$dnX,		"Value":$dnX,
"Value":$dnY,		"Value":$dnY,
"Type":$computeType,		"Type":$computeType,
"Value":$buffer), [{		"Value":$buffer), [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, asyncToken, asyncDependencies, env, modeA,		return build($_builder, $_state, asyncToken, asyncDependencies, modeA,
spmatA, dnX, dnY, computeType, buffer);}]>		spmatA, dnX, dnY, computeType, buffer);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $spmatA (`{` $modeA^ `}`)? `,` $dnX `,` $dnY `,` $buffer attr-dict `:` type($buffer) `into` $computeType		$spmatA (`{` $modeA^ `}`)? `,` $dnX `,` $dnY `,` $buffer attr-dict `:` type($buffer) `into` $computeType
}];		}];
}		}

def GPU_SpMMBufferSizeOp : GPU_Op<"spmm_buffer_size", [GPU_AsyncOpInterface, AttrSizedResultSegments]> {		def GPU_SpMMBufferSizeOp : GPU_Op<"spmm_buffer_size", [GPU_AsyncOpInterface, AttrSizedResultSegments]> {
let summary = "Precompute buffersize for SpMM operation";		let summary = "Precompute buffersize for SpMM operation";
let description = [{		let description = [{
The `gpu.spmm_buffer_size` operation returns the buffer size required		The `gpu.spmm_buffer_size` operation returns the buffer size required
to perform the SpMM operation on the given sparse and dense matrix.		to perform the SpMM operation on the given sparse and dense matrix.
The operation expects handles returned by previous sparse operations		The operation expects handles returned by previous sparse operations
to construct an environment and the operands for SpMM.		to construct an environment and the operands for SpMM.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.

Example:		Example:

```mlir		```mlir
%bufferszs, %token = gpu.spmm_buffer_size async [%dep] %env, %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC : i64 into f32		%bufferszs, %token = gpu.spmm_buffer_size async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC : i64 into f32
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_TransposeModeAttr:$modeB,		GPU_TransposeModeAttr:$modeB,
GPU_SparseSpMatHandle:$spmatA,		GPU_SparseSpMatHandle:$spmatA,
GPU_SparseDnTensorHandle:$dnmatB,		GPU_SparseDnTensorHandle:$dnmatB,
GPU_SparseDnTensorHandle:$dnmatC,		GPU_SparseDnTensorHandle:$dnmatC,
TypeAttr:$computeType);		TypeAttr:$computeType);
let results = (outs Variadic<Index>:$bufferSzs,		let results = (outs Variadic<Index>:$bufferSzs,
Optional<GPU_AsyncToken>:$asyncToken);		Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$bufferSzs,		"Type":$bufferSzs,
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$spmatA,		"Value":$spmatA,
"Value":$dnmatB,		"Value":$dnmatB,
"Value":$dnmatC,		"Value":$dnmatC,
"Type":$computeType), [{		"Type":$computeType), [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
auto modeB = gpu::TransposeMode::NON_TRANSPOSE;		auto modeB = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, bufferSzs, asyncToken, asyncDependencies,		return build($_builder, $_state, bufferSzs, asyncToken, asyncDependencies,
env, modeA, modeB, spmatA, dnmatB, dnmatC, computeType);}]>		modeA, modeB, spmatA, dnmatB, dnmatC, computeType);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $spmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $dnmatC attr-dict `:` type($bufferSzs) `into` $computeType		$spmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $dnmatC attr-dict `:` type($bufferSzs) `into` $computeType
}];		}];
}		}

def GPU_SpMMOp : GPU_Op<"spmm", [GPU_AsyncOpInterface, AttrSizedOperandSegments]> {		def GPU_SpMMOp : GPU_Op<"spmm", [GPU_AsyncOpInterface, AttrSizedOperandSegments]> {
let summary = "SpMM operation";		let summary = "SpMM operation";
let description = [{		let description = [{
The `gpu.spmm` operation performs the SpMM operation on the given sparse and		The `gpu.spmm` operation performs the SpMM operation on the given sparse and
dense matrix, and buffer. The operation expects handles returned by previous		dense matrix, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SpMM. The		sparse operations to construct an environment and the operands for SpMM. The
buffer must have been allocated on the device.		buffer must have been allocated on the device.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.

Example:		Example:

```mlir		```mlir
%token = gpu.spmm async [%dep] %env, %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC, %buffers : type($buffers) into f32		%token = gpu.spmm async [%dep] %spmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %dnmatC, %buffers : type($buffers) into f32
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_TransposeModeAttr:$modeB,		GPU_TransposeModeAttr:$modeB,
GPU_SparseSpMatHandle:$spmatA,		GPU_SparseSpMatHandle:$spmatA,
GPU_SparseDnTensorHandle:$dnmatB,		GPU_SparseDnTensorHandle:$dnmatB,
GPU_SparseDnTensorHandle:$dnmatC,		GPU_SparseDnTensorHandle:$dnmatC,
TypeAttr:$computeType,		TypeAttr:$computeType,
Variadic<AnyMemRef>:$buffers);		Variadic<AnyMemRef>:$buffers);
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$spmatA,		"Value":$spmatA,
"Value":$dnmatB,		"Value":$dnmatB,
"Value":$dnmatC,		"Value":$dnmatC,
"Type":$computeType,		"Type":$computeType,
"ValueRange":$buffers), [{		"ValueRange":$buffers), [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
auto modeB = gpu::TransposeMode::NON_TRANSPOSE;		auto modeB = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, asyncToken, asyncDependencies, env, modeA,		return build($_builder, $_state, asyncToken, asyncDependencies, modeA,
modeB, spmatA, dnmatB, dnmatC, computeType, buffers);}]>		modeB, spmatA, dnmatB, dnmatC, computeType, buffers);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $spmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $dnmatC `,` $buffers attr-dict `:` type($buffers) `into` $computeType		$spmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $dnmatC `,` $buffers attr-dict `:` type($buffers) `into` $computeType
}];		}];
}		}

def GPU_SDDMMBufferSizeOp : GPU_Op<"sddmm_buffer_size", [GPU_AsyncOpInterface]> {		def GPU_SDDMMBufferSizeOp : GPU_Op<"sddmm_buffer_size", [GPU_AsyncOpInterface]> {
let summary = "Precompute buffersize for SDDMM operation";		let summary = "Precompute buffersize for SDDMM operation";
let description = [{		let description = [{
The `gpu.sddmm_buffer_size` operation returns the buffer size required		The `gpu.sddmm_buffer_size` operation returns the buffer size required
to perform the SDDMM operation on the given sparse and dense matrices.		to perform the SDDMM operation on the given sparse and dense matrices.
The operation expects handles returned by previous sparse operations		The operation expects handles returned by previous sparse operations
to construct an environment and the operands for SDDMM.		to construct an environment and the operands for SDDMM.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%buffersz, %token = gpu.sddmm_buffer_size async [%dep] %env, %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC into f32		%buffersz, %token = gpu.sddmm_buffer_size async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC into f32
```		```

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_TransposeModeAttr:$modeB,		GPU_TransposeModeAttr:$modeB,
GPU_SparseDnTensorHandle:$dnmatA,		GPU_SparseDnTensorHandle:$dnmatA,
GPU_SparseDnTensorHandle:$dnmatB,		GPU_SparseDnTensorHandle:$dnmatB,
GPU_SparseSpMatHandle:$spmatC,		GPU_SparseSpMatHandle:$spmatC,
TypeAttr:$computeType);		TypeAttr:$computeType);
let results = (outs Res<Index>:$bufferSz, Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Res<Index>:$bufferSz, Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$bufferSz,		"Type":$bufferSz,
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$dnmatA,		"Value":$dnmatA,
"Value":$dnmatB,		"Value":$dnmatB,
"Value":$spmatC,		"Value":$spmatC,
"Type":$computeType), [{		"Type":$computeType), [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
auto modeB = gpu::TransposeMode::NON_TRANSPOSE;		auto modeB = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, bufferSz, asyncToken, asyncDependencies,		return build($_builder, $_state, bufferSz, asyncToken, asyncDependencies,
env, modeA, modeB, dnmatA, dnmatB, spmatC, computeType);}]>		modeA, modeB, dnmatA, dnmatB, spmatC, computeType);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $dnmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $spmatC attr-dict `into` $computeType		$dnmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $spmatC attr-dict `into` $computeType
}];		}];
}		}

def GPU_SDDMMOp : GPU_Op<"sddmm", [GPU_AsyncOpInterface]> {		def GPU_SDDMMOp : GPU_Op<"sddmm", [GPU_AsyncOpInterface]> {
let summary = "SDDMM operation";		let summary = "SDDMM operation";
let description = [{		let description = [{
The `gpu.sddmm` operation performs the SDDMM operation on the given sparse and		The `gpu.sddmm` operation performs the SDDMM operation on the given sparse and
dense matrices, and buffer. The operation expects handles returned by previous		dense matrices, and buffer. The operation expects handles returned by previous
sparse operations to construct an environment and the operands for SDDMM. The		sparse operations to construct an environment and the operands for SDDMM. The
buffer must have been allocated on the device.		buffer must have been allocated on the device.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%token = gpu.sddmm async [%dep] %env, %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC, %buffer into f32		%token = gpu.sddmm async [%dep] %dnmatA{TRANSPOSE}, %dnmatB{TRANSPOSE}, %spmatC, %buffer into f32
```		```

The matrix arguments can also be associated with one of the following		The matrix arguments can also be associated with one of the following
operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value		operators: NON_TRANSPOSE, TRANSPOSE, CONJUGATE_TRANSPOSE. The default value
is NON_TRANSPOSE.		is NON_TRANSPOSE.
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
GPU_SparseEnvHandle:$env,
GPU_TransposeModeAttr:$modeA,		GPU_TransposeModeAttr:$modeA,
GPU_TransposeModeAttr:$modeB,		GPU_TransposeModeAttr:$modeB,
GPU_SparseDnTensorHandle:$dnmatA,		GPU_SparseDnTensorHandle:$dnmatA,
GPU_SparseDnTensorHandle:$dnmatB,		GPU_SparseDnTensorHandle:$dnmatB,
GPU_SparseSpMatHandle:$spmatC,		GPU_SparseSpMatHandle:$spmatC,
TypeAttr:$computeType,		TypeAttr:$computeType,
AnyMemRef:$buffer);		AnyMemRef:$buffer);
let results = (outs Optional<GPU_AsyncToken>:$asyncToken);		let results = (outs Optional<GPU_AsyncToken>:$asyncToken);

let builders = [OpBuilder<(ins		let builders = [OpBuilder<(ins
"Type":$asyncToken,		"Type":$asyncToken,
"ValueRange":$asyncDependencies,		"ValueRange":$asyncDependencies,
"Value":$env,
"Value":$dnmatA,		"Value":$dnmatA,
"Value":$dnmatB,		"Value":$dnmatB,
"Value":$spmatC,		"Value":$spmatC,
"Type":$computeType,		"Type":$computeType,
"Value":$buffer), [{		"Value":$buffer), [{
auto modeA = gpu::TransposeMode::NON_TRANSPOSE;		auto modeA = gpu::TransposeMode::NON_TRANSPOSE;
auto modeB = gpu::TransposeMode::NON_TRANSPOSE;		auto modeB = gpu::TransposeMode::NON_TRANSPOSE;
return build($_builder, $_state, asyncToken, asyncDependencies, env, modeA,		return build($_builder, $_state, asyncToken, asyncDependencies, modeA,
modeB, dnmatA, dnmatB, spmatC, computeType, buffer);}]>		modeB, dnmatA, dnmatB, spmatC, computeType, buffer);}]>
];		];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$env `,` $dnmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $spmatC `,` $buffer attr-dict `:` type($buffer) `into` $computeType		$dnmatA (`{` $modeA^ `}`)? `,` $dnmatB (`{` $modeB^ `}`)? `,` $spmatC `,` $buffer attr-dict `:` type($buffer) `into` $computeType
}];		}];
}		}

#endif // GPU_OPS		#endif // GPU_OPS

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	FunctionCallBuilder memset32CallBuilder = {
llvmIntPtrType /* intptr_t sizeBytes */,		llvmIntPtrType /* intptr_t sizeBytes */,
llvmPointerType /* void stream /}};		llvmPointerType /* void stream /}};
FunctionCallBuilder setDefaultDeviceCallBuilder = {		FunctionCallBuilder setDefaultDeviceCallBuilder = {
"mgpuSetDefaultDevice",		"mgpuSetDefaultDevice",
llvmVoidType,		llvmVoidType,
{llvmInt32Type /* uint32_t devIndex */}};		{llvmInt32Type /* uint32_t devIndex */}};
FunctionCallBuilder createSparseEnvCallBuilder = {		FunctionCallBuilder createSparseEnvCallBuilder = {
"mgpuCreateSparseEnv",		"mgpuCreateSparseEnv",
llvmPointerType,		llvmVoidType,
{llvmPointerType /* void stream /}};		{llvmPointerType /* void stream /}};
FunctionCallBuilder destroySparseEnvCallBuilder = {		FunctionCallBuilder destroySparseEnvCallBuilder = {
"mgpuDestroySparseEnv",		"mgpuDestroySparseEnv",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType /* void stream /}};
FunctionCallBuilder createDnVecCallBuilder = {		FunctionCallBuilder createDnVecCallBuilder = {
"mgpuCreateDnVec",		"mgpuCreateDnVec",
llvmPointerType,		llvmPointerType,
{llvmIntPtrType, llvmPointerType, llvmInt32Type,		{llvmIntPtrType, llvmPointerType, llvmInt32Type,
llvmPointerType /* void stream /}};		llvmPointerType /* void stream /}};
FunctionCallBuilder destroyDnVecCallBuilder = {		FunctionCallBuilder destroyDnVecCallBuilder = {
"mgpuDestroyDnVec",		"mgpuDestroyDnVec",
llvmVoidType,		llvmVoidType,
Show All 27 Lines	FunctionCallBuilder createCsrCallBuilder = {
llvmInt32Type, llvmPointerType /* void stream /}};		llvmInt32Type, llvmPointerType /* void stream /}};
FunctionCallBuilder destroySpMatCallBuilder = {		FunctionCallBuilder destroySpMatCallBuilder = {
"mgpuDestroySpMat",		"mgpuDestroySpMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType, llvmPointerType /* void stream /}};
FunctionCallBuilder spMVBufferSizeCallBuilder = {		FunctionCallBuilder spMVBufferSizeCallBuilder = {
"mgpuSpMVBufferSize",		"mgpuSpMVBufferSize",
llvmIntPtrType,		llvmIntPtrType,
{llvmPointerType, llvmInt32Type, llvmPointerType, llvmPointerType,		{llvmInt32Type, llvmPointerType, llvmPointerType, llvmPointerType,
llvmPointerType, llvmInt32Type, llvmPointerType /* void stream /}};		llvmInt32Type, llvmPointerType /* void stream /}};
FunctionCallBuilder spMVCallBuilder = {		FunctionCallBuilder spMVCallBuilder = {
"mgpuSpMV",		"mgpuSpMV",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmInt32Type, llvmPointerType, llvmPointerType,		{llvmInt32Type, llvmPointerType, llvmPointerType, llvmPointerType,
llvmPointerType, llvmInt32Type, llvmPointerType,		llvmInt32Type, llvmPointerType, llvmPointerType /* void stream /}};
llvmPointerType /* void stream /}};
FunctionCallBuilder createSpMMBufferSizeCallBuilder = {		FunctionCallBuilder createSpMMBufferSizeCallBuilder = {
"mgpuSpMMBufferSize",		"mgpuSpMMBufferSize",
llvmIntPtrType,		llvmIntPtrType,
{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,		{llvmInt32Type, llvmInt32Type, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmInt32Type,		llvmPointerType, llvmInt32Type, llvmPointerType /* void stream /}};
llvmPointerType /* void stream /}};
FunctionCallBuilder createSpMMCallBuilder = {		FunctionCallBuilder createSpMMCallBuilder = {
"mgpuSpMM",		"mgpuSpMM",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,		{llvmInt32Type, llvmInt32Type, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmInt32Type, llvmPointerType,		llvmPointerType, llvmInt32Type, llvmPointerType,
llvmPointerType /* void stream /}};		llvmPointerType /* void stream /}};
FunctionCallBuilder createSDDMMBufferSizeCallBuilder = {		FunctionCallBuilder createSDDMMBufferSizeCallBuilder = {
"mgpuSDDMMBufferSize",		"mgpuSDDMMBufferSize",
llvmIntPtrType,		llvmIntPtrType,
{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,		{llvmInt32Type, llvmInt32Type, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmInt32Type,		llvmPointerType, llvmInt32Type, llvmPointerType /* void stream /}};
llvmPointerType /* void stream /}};
FunctionCallBuilder createSDDMMCallBuilder = {		FunctionCallBuilder createSDDMMCallBuilder = {
"mgpuSDDMM",		"mgpuSDDMM",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,		{llvmInt32Type, llvmInt32Type, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmInt32Type, llvmPointerType,		llvmPointerType, llvmInt32Type, llvmPointerType,
llvmPointerType /* void stream /}};		llvmPointerType /* void stream /}};
FunctionCallBuilder createSparseLtEnvCallBuilder = {		FunctionCallBuilder createSparseLtEnvCallBuilder = {
"mgpuCreateSparseLtEnv",		"mgpuCreateSparseLtEnv",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType /* void stream /}};
FunctionCallBuilder destroySparseLtEnvCallBuilder = {		FunctionCallBuilder destroySparseLtEnvCallBuilder = {
"mgpuDestroySparseLtEnv",		"mgpuDestroySparseLtEnv",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType /* void stream /}};
FunctionCallBuilder createLtDnMatCallBuilder = {		FunctionCallBuilder createLtDnMatCallBuilder = {
"mgpuCreateCuSparseLtDnMat",		"mgpuCreateCuSparseLtDnMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType, llvmIntPtrType, llvmIntPtrType,		{llvmPointerType, llvmIntPtrType, llvmIntPtrType, llvmPointerType,
llvmPointerType, llvmInt32Type, llvmPointerType /* void stream /}};		llvmInt32Type, llvmPointerType /* void stream /}};
FunctionCallBuilder destroyCuSparseLtSpMatBuilder = {		FunctionCallBuilder destroyCuSparseLtSpMatBuilder = {
"mgpuDestroyCuSparseLtSpMat",		"mgpuDestroyCuSparseLtSpMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType, llvmPointerType /* void stream /}};
FunctionCallBuilder destroyCuSparseLtDnMatBuilder = {		FunctionCallBuilder destroyCuSparseLtDnMatBuilder = {
"mgpuDestroyCuSparseLtDnMat",		"mgpuDestroyCuSparseLtDnMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType /* void stream /}};		{llvmPointerType, llvmPointerType /* void stream /}};
FunctionCallBuilder create2To4SpMatCallBuilder = {		FunctionCallBuilder create2To4SpMatCallBuilder = {
"mgpuCusparseLtCreate2To4SpMat",		"mgpuCusparseLtCreate2To4SpMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType, llvmIntPtrType, llvmIntPtrType,		{llvmPointerType, llvmIntPtrType, llvmIntPtrType, llvmPointerType,
llvmPointerType, llvmInt32Type, llvmPointerType /* void stream /}};		llvmInt32Type, llvmPointerType /* void stream /}};
FunctionCallBuilder createCuSparseLtSpMMBufferSizeBuilder = {		FunctionCallBuilder createCuSparseLtSpMMBufferSizeBuilder = {
"mgpuCuSparseLtSpMMBufferSize",		"mgpuCuSparseLtSpMMBufferSize",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType, llvmInt32Type, llvmInt32Type,		{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,
llvmPointerType, llvmPointerType, llvmPointerType, llvmInt32Type,		llvmPointerType, llvmPointerType, llvmInt32Type,
llvmPointerType /void stream*/}};		llvmPointerType /void stream*/}};
FunctionCallBuilder createCuSparseLtSpMMBuilder = {		FunctionCallBuilder createCuSparseLtSpMMBuilder = {
"mgpuCuSparseLtSpMM",		"mgpuCuSparseLtSpMM",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType, llvmPointerType, llvmPointerType,		{llvmPointerType, llvmPointerType, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmPointerType,		llvmPointerType, llvmPointerType, llvmPointerType /void stream*/}};
llvmPointerType /void stream*/}};
};		};

/// A rewrite pattern to convert gpu.host_register operations into a GPU runtime		/// A rewrite pattern to convert gpu.host_register operations into a GPU runtime
/// call. Currently it supports CUDA and ROCm (HIP).		/// call. Currently it supports CUDA and ROCm (HIP).
class ConvertHostRegisterOpToGpuRuntimeCallPattern		class ConvertHostRegisterOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::HostRegisterOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::HostRegisterOp> {
public:		public:
ConvertHostRegisterOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)		ConvertHostRegisterOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
▲ Show 20 Lines • Show All 1,062 Lines • ▼ Show 20 Lines	LogicalResult ConvertCreateSparseEnvOpToGpuRuntimeCallPattern::matchAndRewrite(
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
// Use the cusparseLt create call if the dnmat is used with spmat with		// Use the cusparseLt create call if the dnmat is used with spmat with
// 2:4 sparsity		// 2:4 sparsity
Value handle;		if (op.getRtLibMode() == gpu::RtLibMode::CUSPARSE_AND_CUSPARSE_LT) {
if (isSpMMCusparseLtOp(op.getEnv())) {
// CUDA runner asserts the size is 11024 bytes.		// CUDA runner asserts the size is 11024 bytes.
auto handleSz = rewriter.create<LLVM::ConstantOp>(		createSparseLtEnvCallBuilder.create(loc, rewriter, {stream}).getResult();
loc, getIndexType(), rewriter.getIndexAttr(11024));		}
handle = rewriter.create<LLVM::AllocaOp>(loc, llvmInt8PointerType,		if (op.getRtLibMode() == gpu::RtLibMode::CUSPARSE_AND_CUSPARSE_LT \|\|
llvmInt8Type, handleSz);		op.getRtLibMode() == gpu::RtLibMode::CUSPARSE) {
handle = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, handle);
createSparseLtEnvCallBuilder.create(loc, rewriter, {handle, stream})
.getResult();
} else {
handle =
createSparseEnvCallBuilder.create(loc, rewriter, {stream}).getResult();		createSparseEnvCallBuilder.create(loc, rewriter, {stream}).getResult();
}		}
rewriter.replaceOp(op, {handle, stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

LogicalResult ConvertDestroySparseEnvOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertDestroySparseEnvOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::DestroySparseEnvOp op, OpAdaptor adaptor,		gpu::DestroySparseEnvOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
// Use the cusparseLt destroy call if the dnmat is used with spmat with		// Use the cusparseLt destroy call if the dnmat is used with spmat with
// 2:4 sparsity		// 2:4 sparsity
if (isSpMMCusparseLtOp(op.getEnv())) {		if (op.getRtLibMode() == gpu::RtLibMode::CUSPARSE_AND_CUSPARSE_LT) {
destroySparseLtEnvCallBuilder.create(loc, rewriter,		destroySparseLtEnvCallBuilder.create(loc, rewriter, {stream});
{adaptor.getEnv(), stream});		}
} else {		if (op.getRtLibMode() == gpu::RtLibMode::CUSPARSE_AND_CUSPARSE_LT \|\|
destroySparseEnvCallBuilder.create(loc, rewriter,		op.getRtLibMode() == gpu::RtLibMode::CUSPARSE) {
{adaptor.getEnv(), stream});		destroySparseEnvCallBuilder.create(loc, rewriter, {stream});
}		}
rewriter.replaceOp(op, {stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

LogicalResult ConvertCreateDnTensorOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertCreateDnTensorOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::CreateDnTensorOp op, OpAdaptor adaptor,		gpu::CreateDnTensorOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
Show All 18 Lines	LogicalResult ConvertCreateDnTensorOpToGpuRuntimeCallPattern::matchAndRewrite(
// TODO: For now, we track the use of the handle and lower it to cusparse /		// TODO: For now, we track the use of the handle and lower it to cusparse /
// cusparseLt accordingly. If in a block, both cusparse and cusparseLt are		// cusparseLt accordingly. If in a block, both cusparse and cusparseLt are
// used, we require two separate Creation ops to be the correct logic. In		// used, we require two separate Creation ops to be the correct logic. In
// future, we may add support to using one handle in sparse tensor / GPU		// future, we may add support to using one handle in sparse tensor / GPU
// dialect in both cusparse and cusparseLt. use the cusparseLt create call if		// dialect in both cusparse and cusparseLt. use the cusparseLt create call if
// the dnmat is used with spmat with 2:4 sparsity		// the dnmat is used with spmat with 2:4 sparsity
if (dims.size() == 2) {		if (dims.size() == 2) {
if (isSpMMCusparseLtOp(op.getDnTensor())) {		if (isSpMMCusparseLtOp(op.getDnTensor())) {
auto envHandle = adaptor.getEnv();
auto handleSz = rewriter.create<LLVM::ConstantOp>(		auto handleSz = rewriter.create<LLVM::ConstantOp>(
loc, getIndexType(), rewriter.getIndexAttr(11032));		loc, getIndexType(), rewriter.getIndexAttr(11032));
handle = rewriter.create<LLVM::AllocaOp>(loc, llvmInt8PointerType,		handle = rewriter.create<LLVM::AllocaOp>(loc, llvmInt8PointerType,
llvmInt8Type, handleSz);		llvmInt8Type, handleSz);
handle = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, handle);		handle = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, handle);

createLtDnMatCallBuilder		createLtDnMatCallBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{handle, envHandle, dims[0], dims[1], pTensor, dtp, stream})		{handle, dims[0], dims[1], pTensor, dtp, stream})
.getResult();		.getResult();
} else {		} else {
handle =		handle =
createDnMatCallBuilder		createDnMatCallBuilder
.create(loc, rewriter, {dims[0], dims[1], pTensor, dtp, stream})		.create(loc, rewriter, {dims[0], dims[1], pTensor, dtp, stream})
.getResult();		.getResult();
}		}
} else {		} else {
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	LogicalResult ConvertCreate2To4SpMatOpToGpuRuntimeCallPattern::matchAndRewrite(
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
Value pMat =		Value pMat =
MemRefDescriptor(adaptor.getMemref()).allocatedPtr(rewriter, loc);		MemRefDescriptor(adaptor.getMemref()).allocatedPtr(rewriter, loc);
if (!getTypeConverter()->useOpaquePointers())		if (!getTypeConverter()->useOpaquePointers())
pMat = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pMat);		pMat = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pMat);
Type dType =		Type dType =
llvm::cast<MemRefType>(op.getMemref().getType()).getElementType();		llvm::cast<MemRefType>(op.getMemref().getType()).getElementType();
auto dtp = genConstInt32From(rewriter, loc, getCuSparseDataTypeFrom(dType));		auto dtp = genConstInt32From(rewriter, loc, getCuSparseDataTypeFrom(dType));
auto envHandle = adaptor.getEnv();

// CUDA runner asserts the size is 44104 bytes.		// CUDA runner asserts the size is 44104 bytes.
auto handleSz = rewriter.create<LLVM::ConstantOp>(		auto handleSz = rewriter.create<LLVM::ConstantOp>(
loc, getIndexType(), rewriter.getIndexAttr(44104));		loc, getIndexType(), rewriter.getIndexAttr(44104));
Value handle = rewriter.create<LLVM::AllocaOp>(loc, llvmInt8PointerType,		Value handle = rewriter.create<LLVM::AllocaOp>(loc, llvmInt8PointerType,
llvmInt8Type, handleSz);		llvmInt8Type, handleSz);
handle = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, handle);		handle = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, handle);

create2To4SpMatCallBuilder		create2To4SpMatCallBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{handle, envHandle, adaptor.getRows(), adaptor.getCols(), pMat,		{handle, adaptor.getRows(), adaptor.getCols(), pMat, dtp, stream})
dtp, stream})
.getResult();		.getResult();
rewriter.replaceOp(op, {handle, stream});		rewriter.replaceOp(op, {handle, stream});
return success();		return success();
}		}

LogicalResult ConvertDestroySpMatOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertDestroySpMatOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::DestroySpMatOp op, OpAdaptor adaptor,		gpu::DestroySpMatOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
Show All 20 Lines	LogicalResult ConvertSpMVBufferSizeOpToGpuRuntimeCallPattern::matchAndRewrite(
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto modeA = genConstInt32From(rewriter, loc, op.getModeA());		auto modeA = genConstInt32From(rewriter, loc, op.getModeA());
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
auto bufferSize =		auto bufferSize = spMVBufferSizeCallBuilder
spMVBufferSizeCallBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{adaptor.getEnv(), modeA, adaptor.getSpmatA(),		{modeA, adaptor.getSpmatA(), adaptor.getDnX(),
adaptor.getDnX(), adaptor.getDnY(), computeType, stream})		adaptor.getDnY(), computeType, stream})
.getResult();		.getResult();
rewriter.replaceOp(op, {bufferSize, stream});		rewriter.replaceOp(op, {bufferSize, stream});
return success();		return success();
}		}

LogicalResult ConvertSpMVOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertSpMVOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::SpMVOp op, OpAdaptor adaptor,		gpu::SpMVOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());		auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
Value pBuf =		Value pBuf =
MemRefDescriptor(adaptor.getBuffer()).allocatedPtr(rewriter, loc);		MemRefDescriptor(adaptor.getBuffer()).allocatedPtr(rewriter, loc);
if (!getTypeConverter()->useOpaquePointers())		if (!getTypeConverter()->useOpaquePointers())
pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);		pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);
spMVCallBuilder.create(loc, rewriter,		spMVCallBuilder.create(loc, rewriter,
{adaptor.getEnv(), modeA, adaptor.getSpmatA(),		{modeA, adaptor.getSpmatA(), adaptor.getDnX(),
adaptor.getDnX(), adaptor.getDnY(), computeType, pBuf,		adaptor.getDnY(), computeType, pBuf, stream});
stream});
rewriter.replaceOp(op, {stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

LogicalResult ConvertSpMMBufferSizeOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertSpMMBufferSizeOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::SpMMBufferSizeOp op, OpAdaptor adaptor,		gpu::SpMMBufferSizeOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());		auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());
auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());		auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
Value bufferSize;		Value bufferSize;
if (is2To4Sparsity(op.getSpmatA())) {		if (is2To4Sparsity(op.getSpmatA())) {
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseLtDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseLtDataTypeFrom(adaptor.getComputeType()));
auto three = rewriter.create<LLVM::ConstantOp>(loc, getIndexType(),		auto three = rewriter.create<LLVM::ConstantOp>(loc, getIndexType(),
rewriter.getIndexAttr(3));		rewriter.getIndexAttr(3));
auto bufferSize = rewriter.create<LLVM::AllocaOp>(loc, llvmInt64PointerType,		auto bufferSize = rewriter.create<LLVM::AllocaOp>(loc, llvmInt64PointerType,
llvmInt64Type, three);		llvmInt64Type, three);
createCuSparseLtSpMMBufferSizeBuilder		createCuSparseLtSpMMBufferSizeBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{bufferSize, adaptor.getEnv(), modeA, modeB,		{bufferSize, modeA, modeB, adaptor.getSpmatA(),
adaptor.getSpmatA(), adaptor.getDnmatB(), adaptor.getDnmatC(),		adaptor.getDnmatB(), adaptor.getDnmatC(), computeType, stream})
computeType, stream})
.getResult();		.getResult();

auto bufferSizePtr1 = rewriter.create<LLVM::GEPOp>(		auto bufferSizePtr1 = rewriter.create<LLVM::GEPOp>(
loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,		loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,
ValueRange{rewriter.create<LLVM::ConstantOp>(		ValueRange{rewriter.create<LLVM::ConstantOp>(
loc, getIndexType(), rewriter.getIndexAttr(1))});		loc, getIndexType(), rewriter.getIndexAttr(1))});
auto bufferSizePtr2 = rewriter.create<LLVM::GEPOp>(		auto bufferSizePtr2 = rewriter.create<LLVM::GEPOp>(
loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,		loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,
ValueRange{rewriter.create<LLVM::ConstantOp>(		ValueRange{rewriter.create<LLVM::ConstantOp>(
loc, getIndexType(), rewriter.getIndexAttr(2))});		loc, getIndexType(), rewriter.getIndexAttr(2))});
auto bufferSize0 =		auto bufferSize0 =
rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSize);		rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSize);
auto bufferSize1 =		auto bufferSize1 =
rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSizePtr1);		rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSizePtr1);
auto bufferSize2 =		auto bufferSize2 =
rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSizePtr2);		rewriter.create<LLVM::LoadOp>(loc, llvmInt64Type, bufferSizePtr2);

rewriter.replaceOp(op, {bufferSize0, bufferSize1, bufferSize2, stream});		rewriter.replaceOp(op, {bufferSize0, bufferSize1, bufferSize2, stream});
} else {		} else {
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));
bufferSize = createSpMMBufferSizeCallBuilder		bufferSize =
		createSpMMBufferSizeCallBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{adaptor.getEnv(), modeA, modeB,		{modeA, modeB, adaptor.getSpmatA(), adaptor.getDnmatB(),
adaptor.getSpmatA(), adaptor.getDnmatB(),
adaptor.getDnmatC(), computeType, stream})		adaptor.getDnmatC(), computeType, stream})
.getResult();		.getResult();
rewriter.replaceOp(op, {bufferSize, stream});		rewriter.replaceOp(op, {bufferSize, stream});
}		}
return success();		return success();
}		}

LogicalResult ConvertSDDMMBufferSizeOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertSDDMMBufferSizeOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::SDDMMBufferSizeOp op, OpAdaptor adaptor,		gpu::SDDMMBufferSizeOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());		auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());
auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());		auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
auto bufferSize = createSDDMMBufferSizeCallBuilder		auto bufferSize =
		createSDDMMBufferSizeCallBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{adaptor.getEnv(), modeA, modeB,		{modeA, modeB, adaptor.getDnmatA(), adaptor.getDnmatB(),
adaptor.getDnmatA(), adaptor.getDnmatB(),
adaptor.getSpmatC(), computeType, stream})		adaptor.getSpmatC(), computeType, stream})
.getResult();		.getResult();
rewriter.replaceOp(op, {bufferSize, stream});		rewriter.replaceOp(op, {bufferSize, stream});
return success();		return success();
}		}

LogicalResult ConvertSpMMOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertSpMMOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::SpMMOp op, OpAdaptor adaptor,		gpu::SpMMOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
Show All 13 Lines	if (is2To4Sparsity(op.getSpmatA())) {
for (Value buffer : adaptor.getBuffers()) {		for (Value buffer : adaptor.getBuffers()) {
Value pBuf = MemRefDescriptor(buffer).allocatedPtr(rewriter, loc);		Value pBuf = MemRefDescriptor(buffer).allocatedPtr(rewriter, loc);
if (!getTypeConverter()->useOpaquePointers())		if (!getTypeConverter()->useOpaquePointers())
pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);		pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);
pBufs.push_back(pBuf);		pBufs.push_back(pBuf);
}		}
createCuSparseLtSpMMBuilder.create(		createCuSparseLtSpMMBuilder.create(
loc, rewriter,		loc, rewriter,
{adaptor.getEnv(), adaptor.getSpmatA(), adaptor.getDnmatB(),		{adaptor.getSpmatA(), adaptor.getDnmatB(), adaptor.getDnmatC(),
adaptor.getDnmatC(), pBufs[0], pBufs[1], pBufs[2], stream});		pBufs[0], pBufs[1], pBufs[2], stream});
} else {		} else {
Value pBuf = MemRefDescriptor(adaptor.getBuffers().front())		Value pBuf = MemRefDescriptor(adaptor.getBuffers().front())
.allocatedPtr(rewriter, loc);		.allocatedPtr(rewriter, loc);
if (!getTypeConverter()->useOpaquePointers())		if (!getTypeConverter()->useOpaquePointers())
pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);		pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);
createSpMMCallBuilder.create(		createSpMMCallBuilder.create(loc, rewriter,
loc, rewriter,		{modeA, modeB, adaptor.getSpmatA(),
{adaptor.getEnv(), modeA, modeB, adaptor.getSpmatA(),		adaptor.getDnmatB(), adaptor.getDnmatC(),
adaptor.getDnmatB(), adaptor.getDnmatC(), computeType, pBuf, stream});		computeType, pBuf, stream});
}		}
rewriter.replaceOp(op, {stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

template <typename T>		template <typename T>
static void addOpaquePointerConversion(LLVMTypeConverter &converter) {		static void addOpaquePointerConversion(LLVMTypeConverter &converter) {
converter.addConversion([&converter](T) -> Type {		converter.addConversion([&converter](T) -> Type {
Show All 13 Lines	auto computeType = genConstInt32From(
rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseDataTypeFrom(adaptor.getComputeType()));
auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());		auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());
auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());		auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
Value pBuf =		Value pBuf =
MemRefDescriptor(adaptor.getBuffer()).allocatedPtr(rewriter, loc);		MemRefDescriptor(adaptor.getBuffer()).allocatedPtr(rewriter, loc);
if (!getTypeConverter()->useOpaquePointers())		if (!getTypeConverter()->useOpaquePointers())
pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);		pBuf = rewriter.create<LLVM::BitcastOp>(loc, llvmPointerType, pBuf);
createSDDMMCallBuilder.create(		createSDDMMCallBuilder.create(loc, rewriter,
loc, rewriter,		{modeA, modeB, adaptor.getDnmatA(),
{adaptor.getEnv(), modeA, modeB, adaptor.getDnmatA(), adaptor.getDnmatB(),		adaptor.getDnmatB(), adaptor.getSpmatC(),
adaptor.getSpmatC(), computeType, pBuf, stream});		computeType, pBuf, stream});
rewriter.replaceOp(op, {stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation,		StringRef gpuBinaryAnnotation,
bool kernelBarePtrCallConv) {		bool kernelBarePtrCallConv) {
Show All 34 Lines

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

Show All 31 Lines
using namespace mlir::sparse_tensor;		using namespace mlir::sparse_tensor;

namespace {		namespace {

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Helper methods.		// Helper methods.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Marks the given top module as a GPU container module.		/// Marks the given top module as a GPU container module.
		aartbikUnsubmitted Done Reply Inline Actions add empty line back aartbik: add empty line back
		aartbikUnsubmitted Not Done Reply Inline Actions still there? aartbik: still there?
static void markAsGPUContainer(ModuleOp topModule) {		static void markAsGPUContainer(ModuleOp topModule) {
topModule->setAttr(gpu::GPUDialect::getContainerModuleAttrName(),		topModule->setAttr(gpu::GPUDialect::getContainerModuleAttrName(),
UnitAttr::get(topModule->getContext()));		UnitAttr::get(topModule->getContext()));
}		}

/// Constructs a new GPU module (for GPU kernels) inside the given top module,		/// Constructs a new GPU module (for GPU kernels) inside the given top module,
/// or returns an existing GPU module if one was built previously.		/// or returns an existing GPU module if one was built previously.
static gpu::GPUModuleOp genGPUModule(OpBuilder &builder, ModuleOp topModule) {		static gpu::GPUModuleOp genGPUModule(OpBuilder &builder, ModuleOp topModule) {
for (auto op : topModule.getBodyRegion().getOps<gpu::GPUModuleOp>())		for (auto op : topModule.getBodyRegion().getOps<gpu::GPUModuleOp>())
return op; // existing		return op; // existing
markAsGPUContainer(topModule);		markAsGPUContainer(topModule);
builder.setInsertionPointToStart(&topModule.getBodyRegion().front());		builder.setInsertionPointToStart(&topModule.getBodyRegion().front());
return builder.create<gpu::GPUModuleOp>(topModule->getLoc(),		return builder.create<gpu::GPUModuleOp>(topModule->getLoc(),
"sparse_kernels");		"sparse_kernels");
}		}

/// Constructs a new GPU kernel in the given GPU module.		/// Constructs a new GPU kernel in the given GPU module.
static gpu::GPUFuncOp genGPUFunc(OpBuilder &builder, gpu::GPUModuleOp gpuModule,		static gpu::GPUFuncOp genGPUFunc(OpBuilder &builder, gpu::GPUModuleOp gpuModule,
SmallVectorImpl<Value> &args) {		SmallVectorImpl<Value> &args) {
// Get a unique kernel name. Not very creative,		// Get a unique kernel name. Not very creative,
// but we simply try kernel0, kernel1, etc.		// but we simply try kernel0, kernel1, etc.
unsigned kernelNumber = 0;		unsigned kernelNumber = 0;
SmallString<16> kernelName;		SmallString<16> kernelName;
		aartbikUnsubmitted Done Reply Inline Actions I believe codegen already provides createFuncCall(rewriter, loc, "foo", {}, {}, EmitCInterface::Off); for this? but, not needed per our convention aartbik: I believe codegen already provides createFuncCall(rewriter, loc, "foo", {}, {}…
do {		do {
kernelName.clear();		kernelName.clear();
("kernel" + Twine(kernelNumber++)).toStringRef(kernelName);		("kernel" + Twine(kernelNumber++)).toStringRef(kernelName);
} while (gpuModule.lookupSymbol(kernelName));		} while (gpuModule.lookupSymbol(kernelName));
// Then we insert a new kernel with given arguments into the module.		// Then we insert a new kernel with given arguments into the module.
builder.setInsertionPointToStart(&gpuModule.getBodyRegion().front());		builder.setInsertionPointToStart(&gpuModule.getBodyRegion().front());
SmallVector<Type> argsTp;		SmallVector<Type> argsTp;
for (unsigned i = 0, e = args.size(); i < e; i++)		for (unsigned i = 0, e = args.size(); i < e; i++)
▲ Show 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	#endif
assert(colA);		assert(colA);
return builder.create<gpu::CreateCsrOp>(loc, handleTp, tokenTp, token, sz1,		return builder.create<gpu::CreateCsrOp>(loc, handleTp, tokenTp, token, sz1,
sz2, nseA, rowA, colA, valA);		sz2, nseA, rowA, colA, valA);
}		}

/// Match and rewrite SpMV kernel.		/// Match and rewrite SpMV kernel.
static LogicalResult rewriteSpMV(PatternRewriter &rewriter,		static LogicalResult rewriteSpMV(PatternRewriter &rewriter,
linalg::GenericOp op, bool enableRT) {		linalg::GenericOp op, bool enableRT) {
Location loc = op.getLoc();		Location loc = op.getLoc();
		aartbikUnsubmitted Done Reply Inline Actions remove all this we assume client will do this aartbik: remove all this we assume client will do this
Value a = op.getOperand(0);		Value a = op.getOperand(0);
Value x = op.getOperand(1);		Value x = op.getOperand(1);
Value y = op.getOperand(2); // we have y = Ax		Value y = op.getOperand(2); // we have y = Ax
SmallVector<Value> tokens;		SmallVector<Value> tokens;

// Only admissible sparse matrix format and dense vectors.		// Only admissible sparse matrix format and dense vectors.
bool isCOO = false;		bool isCOO = false;
SparseTensorType aTp = getSparseTensorType(a);		SparseTensorType aTp = getSparseTensorType(a);
Show All 19 Lines	static LogicalResult rewriteSpMV(PatternRewriter &rewriter,
Value vecX = genAllocCopy(rewriter, loc, memX, tokens);		Value vecX = genAllocCopy(rewriter, loc, memX, tokens);
Value memY = genTensorToMemref(rewriter, loc, y);		Value memY = genTensorToMemref(rewriter, loc, y);
Value vecY = genAllocCopy(rewriter, loc, memY, tokens);		Value vecY = genAllocCopy(rewriter, loc, memY, tokens);
genBlockingWait(rewriter, loc, tokens);		genBlockingWait(rewriter, loc, tokens);
tokens.clear();		tokens.clear();

// Create sparse environment and sparse matrix/dense vector handles.		// Create sparse environment and sparse matrix/dense vector handles.
Type indexTp = rewriter.getIndexType();		Type indexTp = rewriter.getIndexType();
Type envHandleTp = rewriter.getType<gpu::SparseEnvHandleType>();
Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();		Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();
Type spmatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();		Type spmatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();
Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();		Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();
Value token = genFirstWait(rewriter, loc);		Value token = genFirstWait(rewriter, loc);
auto env =		auto env = rewriter.create<gpu::CreateSparseEnvOp>(loc, tokenTp, token,
rewriter.create<gpu::CreateSparseEnvOp>(loc, envHandleTp, tokenTp, token);		gpu::RtLibMode::CUSPARSE);
Value handle = env.getResult(0);
token = env.getAsyncToken();		token = env.getAsyncToken();
		K-WuAuthorUnsubmitted Done Reply Inline Actions I also noted TODO here K-Wu: I also noted TODO here
Operation *spGenA =		Operation *spGenA =
genSpMat(rewriter, loc, spmatHandleTp, tokenTp, token, szY, szX, nseA,		genSpMat(rewriter, loc, spmatHandleTp, tokenTp, token, szY, szX, nseA,
rowA, colA, valA, isCOO, enableRT);		rowA, colA, valA, isCOO, enableRT);
Value spMatA = spGenA->getResult(0);		Value spMatA = spGenA->getResult(0);
token = spGenA->getResult(1);		token = spGenA->getResult(1);
auto dvecX = rewriter.create<gpu::CreateDnTensorOp>(		auto dvecX = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnTensorHandleTp, tokenTp, token, handle, vecX, szX);		loc, dnTensorHandleTp, tokenTp, token, vecX, szX);
Value dnX = dvecX.getResult(0);		Value dnX = dvecX.getResult(0);
token = dvecX.getAsyncToken();		token = dvecX.getAsyncToken();
auto dvecY = rewriter.create<gpu::CreateDnTensorOp>(		auto dvecY = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnTensorHandleTp, tokenTp, token, handle, vecY, szY);		loc, dnTensorHandleTp, tokenTp, token, vecY, szY);
Value dnY = dvecY.getResult(0);		Value dnY = dvecY.getResult(0);
token = dvecY.getAsyncToken();		token = dvecY.getAsyncToken();

auto dnYType = llvm::cast<ShapedType>(y.getType()).getElementType();		auto dnYType = llvm::cast<ShapedType>(y.getType()).getElementType();

// Precompute buffersize for SpMV.		// Precompute buffersize for SpMV.
auto bufferComp = rewriter.create<gpu::SpMVBufferSizeOp>(		auto bufferComp = rewriter.create<gpu::SpMVBufferSizeOp>(
loc, indexTp, tokenTp, token, handle, spMatA, dnX, dnY,		loc, indexTp, tokenTp, token, spMatA, dnX, dnY,
/computeType=/dnYType);		/computeType=/dnYType);
Value bufferSz = bufferComp.getResult(0);		Value bufferSz = bufferComp.getResult(0);
token = bufferComp.getAsyncToken();		token = bufferComp.getAsyncToken();
auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);		auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);
Value buffer = buf.getResult(0);		Value buffer = buf.getResult(0);
token = buf.getAsyncToken();		token = buf.getAsyncToken();

// Perform the SpMV.		// Perform the SpMV.
auto spmvComp =		auto spmvComp = rewriter.create<gpu::SpMVOp>(
rewriter.create<gpu::SpMVOp>(loc, tokenTp, token, handle, spMatA, dnX,		loc, tokenTp, token, spMatA, dnX, dnY, /computeType=/dnYType, buffer);
dnY, /computeType=/dnYType, buffer);
token = spmvComp.getAsyncToken();		token = spmvComp.getAsyncToken();

// Copy data back to host and free all the resoures.		// Copy data back to host and free all the resoures.
token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatA)		token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatA)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnX)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnX)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnY)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnY)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token, handle)		token = rewriter
		.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token,
		gpu::RtLibMode::CUSPARSE)
.getAsyncToken();		.getAsyncToken();
token = genDeallocMemRef(rewriter, loc, rowA, token);		token = genDeallocMemRef(rewriter, loc, rowA, token);
if (colA)		if (colA)
token = genDeallocMemRef(rewriter, loc, colA, token);		token = genDeallocMemRef(rewriter, loc, colA, token);
token = genDeallocMemRef(rewriter, loc, valA, token);		token = genDeallocMemRef(rewriter, loc, valA, token);
token = genDeallocMemRef(rewriter, loc, buffer, token);		token = genDeallocMemRef(rewriter, loc, buffer, token);
token = genDeallocMemRef(rewriter, loc, vecX, token);		token = genDeallocMemRef(rewriter, loc, vecX, token);
token = genCopyMemRef(rewriter, loc, memY, vecY, token);		token = genCopyMemRef(rewriter, loc, memY, vecY, token);
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	static LogicalResult rewriteSpMM(PatternRewriter &rewriter,
Value matB = genAllocCopy(rewriter, loc, bufB, tokens);		Value matB = genAllocCopy(rewriter, loc, bufB, tokens);
Value bufC = genTensorToMemref(rewriter, loc, c);		Value bufC = genTensorToMemref(rewriter, loc, c);
Value matC = genAllocCopy(rewriter, loc, bufC, tokens);		Value matC = genAllocCopy(rewriter, loc, bufC, tokens);
genBlockingWait(rewriter, loc, tokens);		genBlockingWait(rewriter, loc, tokens);
tokens.clear();		tokens.clear();

// Create sparse environment and sparse matrix/dense matrix handles.		// Create sparse environment and sparse matrix/dense matrix handles.
Type indexTp = rewriter.getIndexType();		Type indexTp = rewriter.getIndexType();
Type envHandleTp = rewriter.getType<gpu::SparseEnvHandleType>();
Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();		Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();
Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();		Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();
Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();		Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();
Value token = genFirstWait(rewriter, loc);		Value token = genFirstWait(rewriter, loc);
auto env =		auto env = rewriter.create<gpu::CreateSparseEnvOp>(loc, tokenTp, token,
rewriter.create<gpu::CreateSparseEnvOp>(loc, envHandleTp, tokenTp, token);		gpu::RtLibMode::CUSPARSE);
Value handle = env.getResult(0);
token = env.getAsyncToken();		token = env.getAsyncToken();
Operation *spGenA =		Operation *spGenA =
genSpMat(rewriter, loc, spMatHandleTp, tokenTp, token, szm, szk, nseA,		genSpMat(rewriter, loc, spMatHandleTp, tokenTp, token, szm, szk, nseA,
		aartbikUnsubmitted Done Reply Inline Actions yeah, none of the env stuff in this file aartbik: yeah, none of the env stuff in this file
rowA, colA, valA, isCOO, enableRT);		rowA, colA, valA, isCOO, enableRT);
Value spMatA = spGenA->getResult(0);		Value spMatA = spGenA->getResult(0);
token = spGenA->getResult(1);		token = spGenA->getResult(1);
auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(		auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnTensorHandleTp, tokenTp, token, handle, matB,		loc, dnTensorHandleTp, tokenTp, token, matB,
SmallVector<Value>{szk, szn});		SmallVector<Value>{szk, szn});
Value dnB = dmatB.getResult(0);		Value dnB = dmatB.getResult(0);
token = dmatB.getAsyncToken();		token = dmatB.getAsyncToken();
auto dmatC = rewriter.create<gpu::CreateDnTensorOp>(		auto dmatC = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnTensorHandleTp, tokenTp, token, handle, matC,		loc, dnTensorHandleTp, tokenTp, token, matC,
SmallVector<Value>{szm, szn});		SmallVector<Value>{szm, szn});
Value dnC = dmatC.getResult(0);		Value dnC = dmatC.getResult(0);
token = dmatC.getAsyncToken();		token = dmatC.getAsyncToken();

auto dmatCType = llvm::cast<ShapedType>(c.getType()).getElementType();		auto dmatCType = llvm::cast<ShapedType>(c.getType()).getElementType();

// Precompute buffersize for SpMM.		// Precompute buffersize for SpMM.
auto bufferComp = rewriter.create<gpu::SpMMBufferSizeOp>(		auto bufferComp = rewriter.create<gpu::SpMMBufferSizeOp>(
loc, indexTp, tokenTp, token, handle, spMatA, dnB, dnC,		loc, indexTp, tokenTp, token, spMatA, dnB, dnC,
/computeType=/dmatCType);		/computeType=/dmatCType);
Value bufferSz = bufferComp.getResult(0);		Value bufferSz = bufferComp.getResult(0);
token = bufferComp.getAsyncToken();		token = bufferComp.getAsyncToken();
auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);		auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);
Value buffer = buf.getResult(0);		Value buffer = buf.getResult(0);
token = buf.getAsyncToken();		token = buf.getAsyncToken();

auto dnCType = llvm::cast<ShapedType>(c.getType()).getElementType();		auto dnCType = llvm::cast<ShapedType>(c.getType()).getElementType();

// Perform the SpMM.		// Perform the SpMM.
auto spmmComp =		auto spmmComp = rewriter.create<gpu::SpMMOp>(
rewriter.create<gpu::SpMMOp>(loc, tokenTp, token, handle, spMatA, dnB,		loc, tokenTp, token, spMatA, dnB, dnC, /computeType=/dnCType, buffer);
dnC, /computeType=/dnCType, buffer);
token = spmmComp.getAsyncToken();		token = spmmComp.getAsyncToken();

// Copy data back to host and free all the resoures.		// Copy data back to host and free all the resoures.
token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatA)		token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatA)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnB)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnB)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnC)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnC)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token, handle)		token = rewriter
		.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token,
		gpu::RtLibMode::CUSPARSE)
.getAsyncToken();		.getAsyncToken();
token = genDeallocMemRef(rewriter, loc, rowA, token);		token = genDeallocMemRef(rewriter, loc, rowA, token);
aartbikUnsubmitted Not Done Reply Inline Actions This should not have been removed. Sending out a fix. aartbik: This should not have been removed. Sending out a fix.
if (colA)		if (colA)
token = genDeallocMemRef(rewriter, loc, colA, token);		token = genDeallocMemRef(rewriter, loc, colA, token);
token = genDeallocMemRef(rewriter, loc, valA, token);		token = genDeallocMemRef(rewriter, loc, valA, token);
token = genDeallocMemRef(rewriter, loc, buffer, token);		token = genDeallocMemRef(rewriter, loc, buffer, token);
token = genDeallocMemRef(rewriter, loc, matB, token);		token = genDeallocMemRef(rewriter, loc, matB, token);
token = genCopyMemRef(rewriter, loc, bufC, matC, token);		token = genCopyMemRef(rewriter, loc, bufC, matC, token);
token = genDeallocMemRef(rewriter, loc, matC, token);		token = genDeallocMemRef(rewriter, loc, matC, token);
tokens.push_back(token);		tokens.push_back(token);
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	static LogicalResult rewriteSDDMM(PatternRewriter &rewriter,
Value rowC = genAllocCopy(rewriter, loc, memR, tokens);		Value rowC = genAllocCopy(rewriter, loc, memR, tokens);
Value colC = memC ? genAllocCopy(rewriter, loc, memC, tokens) : Value();		Value colC = memC ? genAllocCopy(rewriter, loc, memC, tokens) : Value();
Value valC = genAllocCopy(rewriter, loc, memV, tokens);		Value valC = genAllocCopy(rewriter, loc, memV, tokens);
genBlockingWait(rewriter, loc, tokens);		genBlockingWait(rewriter, loc, tokens);
tokens.clear();		tokens.clear();

// Create sparse environment and sparse matrix/dense matrix handles.		// Create sparse environment and sparse matrix/dense matrix handles.
Type indexTp = rewriter.getIndexType();		Type indexTp = rewriter.getIndexType();
Type envHandleTp = rewriter.getType<gpu::SparseEnvHandleType>();
Type dnMatHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();		Type dnMatHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();
Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();		Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();
Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();		Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();
Value token = genFirstWait(rewriter, loc);		Value token = genFirstWait(rewriter, loc);
auto env =		auto env = rewriter.create<gpu::CreateSparseEnvOp>(loc, tokenTp, token,
rewriter.create<gpu::CreateSparseEnvOp>(loc, envHandleTp, tokenTp, token);		gpu::RtLibMode::CUSPARSE);
Value handle = env.getResult(0);
token = env.getAsyncToken();		token = env.getAsyncToken();

auto dmatA = rewriter.create<gpu::CreateDnTensorOp>(		auto dmatA = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnMatHandleTp, tokenTp, token, handle, matA,		loc, dnMatHandleTp, tokenTp, token, matA, SmallVector<Value>{szm, szk});
SmallVector<Value>{szm, szk});
Value dnA = dmatA.getResult(0);		Value dnA = dmatA.getResult(0);
token = dmatA.getAsyncToken();		token = dmatA.getAsyncToken();
auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(		auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnMatHandleTp, tokenTp, token, handle, matB,		loc, dnMatHandleTp, tokenTp, token, matB, SmallVector<Value>{szk, szn});
SmallVector<Value>{szk, szn});
Value dnB = dmatB.getResult(0);		Value dnB = dmatB.getResult(0);
token = dmatB.getAsyncToken();		token = dmatB.getAsyncToken();

Operation *spGenC =		Operation *spGenC =
genSpMat(rewriter, loc, spMatHandleTp, tokenTp, token, szm, szn, nseC,		genSpMat(rewriter, loc, spMatHandleTp, tokenTp, token, szm, szn, nseC,
rowC, colC, valC, isCOO, enableRT);		rowC, colC, valC, isCOO, enableRT);
Value spMatC = spGenC->getResult(0);		Value spMatC = spGenC->getResult(0);
token = spGenC->getResult(1);		token = spGenC->getResult(1);

auto dnCType = llvm::cast<ShapedType>(c.getType()).getElementType();		auto dnCType = llvm::cast<ShapedType>(c.getType()).getElementType();
// Precompute buffersize for SDDMM.		// Precompute buffersize for SDDMM.
auto bufferComp = rewriter.create<gpu::SDDMMBufferSizeOp>(		auto bufferComp = rewriter.create<gpu::SDDMMBufferSizeOp>(
loc, indexTp, tokenTp, token, handle, dnA, dnB, spMatC, dnCType);		loc, indexTp, tokenTp, token, dnA, dnB, spMatC, dnCType);
Value bufferSz = bufferComp.getResult(0);		Value bufferSz = bufferComp.getResult(0);
token = bufferComp.getAsyncToken();		token = bufferComp.getAsyncToken();
auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);		auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);
Value buffer = buf.getResult(0);		Value buffer = buf.getResult(0);
token = buf.getAsyncToken();		token = buf.getAsyncToken();

// Perform the SDDMM.		// Perform the SDDMM.
auto sddmmComp = rewriter.create<gpu::SDDMMOp>(		auto sddmmComp = rewriter.create<gpu::SDDMMOp>(loc, tokenTp, token, dnA, dnB,
loc, tokenTp, token, handle, dnA, dnB, spMatC, dnCType, buffer);		spMatC, dnCType, buffer);
token = sddmmComp.getAsyncToken();		token = sddmmComp.getAsyncToken();

// Copy data back to host and free all the resoures.		// Copy data back to host and free all the resoures.
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnA)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnA)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnB)		token = rewriter.create<gpu::DestroyDnTensorOp>(loc, tokenTp, token, dnB)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatC)		token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatC)
.getAsyncToken();		.getAsyncToken();
token = rewriter.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token, handle)		token = rewriter
		.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token,
		gpu::RtLibMode::CUSPARSE)
.getAsyncToken();		.getAsyncToken();
token = genDeallocMemRef(rewriter, loc, buffer, token);		token = genDeallocMemRef(rewriter, loc, buffer, token);
token = genDeallocMemRef(rewriter, loc, matA, token);		token = genDeallocMemRef(rewriter, loc, matA, token);
token = genDeallocMemRef(rewriter, loc, matB, token);		token = genDeallocMemRef(rewriter, loc, matB, token);
token = genDeallocMemRef(rewriter, loc, rowC, token);		token = genDeallocMemRef(rewriter, loc, rowC, token);
if (colC)		if (colC)
token = genDeallocMemRef(rewriter, loc, colC, token);		token = genDeallocMemRef(rewriter, loc, colC, token);
token = genCopyMemRef(rewriter, loc, memV, valC, token);		token = genCopyMemRef(rewriter, loc, memV, valC, token);
▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	ScopedContext() {
}();		}();

CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));		CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));
}		}

~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }		~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }
};		};

		#ifdef MLIR_ENABLE_CUDA_CUSPARSE
		// Create the cusparse handles once for the duration of the instance
		aartbikUnsubmitted Done Reply Inline Actions Remove all this scoped stuff and initializer bool all together in favor of (1) a single static handle handle (2) create/destroy methods then you can simply have Single handle shared between all cuSparse calls. The client is responsible for calling mgpuCreateSparseEnv() and d mgpuDestroySparseEnv() // on entering and exiting the module containing sparsified GPU code. static cusparseHandle_t env = nullptr; void mgpuCreateSparseEnv() } assert(!handle); CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))); } extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuDestroySparseEnv() { assert(handle); CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle)) handle = nullptr; } and then all methods have extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void a, void x, void y, int32_t ctp, void buf, CUstream /stream/) { assert(handle) && "client did not call mgpuCreateSparseEnv()"; aartbik: Remove all this scoped stuff and initializer bool all together in favor of (1) a single…
		K-WuAuthorUnsubmitted Done Reply Inline Actions Addressed. Let me know you thoughts! K-Wu: Addressed. Let me know you thoughts!
		class ScopedCuSparseHandleStorage {
		public:
		static cusparseHandle_t env;
		static bool initiated;
		ScopedCuSparseHandleStorage() {
		// Static reference to CUDA cuSparse environment handle
		if (!initiated) {
		CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&env));
		initiated = true;
		}
		}

		~ScopedCuSparseHandleStorage() {}
		};

		cusparseHandle_t ScopedCuSparseHandleStorage::env = nullptr;
		bool ScopedCuSparseHandleStorage::initiated = false;

		#ifdef MLIR_ENABLE_CUDA_CUSPARSELT
		class ScopedCuSparseLtHandleStorage {
		public:
		static cusparseLtHandle_t env;
		static bool initiated;
		ScopedCuSparseLtHandleStorage() {
		// Static reference to CUDA cuSparseLt environment handle
		if (!initiated) {
		aartbikUnsubmitted Done Reply Inline Actions this feels very thread unsafe! I would expect something like a static initializer to take care of this. Right now, the setup is done on an executing thread, so having more than one gets in trouble aartbik: this feels very thread unsafe! I would expect something like a static initializer to take care…
		K-WuAuthorUnsubmitted Done Reply Inline Actions I moved the initialization into the create env functions. How does it look? K-Wu: I moved the initialization into the create env functions. How does it look?
		initiated = true;
		// note that cuSparseLt still uses cusparseStatus_t
		CUSPARSE_REPORT_IF_ERROR(cusparseLtInit(&env));
		}
		}

		~ScopedCuSparseLtHandleStorage() {}
		};

		cusparseLtHandle_t ScopedCuSparseLtHandleStorage::env;
		bool ScopedCuSparseLtHandleStorage::initiated = false;

		#endif // MLIR_ENABLE_CUDA_CUSPARSELT
		#endif // MLIR_ENABLE_CUDA_CUSPARSE

extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {
ScopedContext scopedContext;		ScopedContext scopedContext;
CUmodule module = nullptr;		CUmodule module = nullptr;
CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));		CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));
return module;		return module;
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	#define ALPHABETA(dtp, alpha, beta) \
} else if (dtp == CUDA_R_32F \|\| dtp == CUDA_C_32F) { \		} else if (dtp == CUDA_R_32F \|\| dtp == CUDA_C_32F) { \
(alpha##p) = reinterpret_cast<void *>(&(alpha##f)); \		(alpha##p) = reinterpret_cast<void *>(&(alpha##f)); \
(beta##p) = reinterpret_cast<void *>(&(beta##f)); \		(beta##p) = reinterpret_cast<void *>(&(beta##f)); \
} else { \		} else { \
(alpha##p) = reinterpret_cast<void *>(&(alpha##d)); \		(alpha##p) = reinterpret_cast<void *>(&(alpha##d)); \
(beta##p) = reinterpret_cast<void *>(&(beta##d)); \		(beta##p) = reinterpret_cast<void *>(&(beta##d)); \
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCreateSparseEnv(CUstream /stream/) {		mgpuCreateSparseEnv(CUstream /stream/) {
cusparseHandle_t handle = nullptr;		ScopedCuSparseHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseCreate(&handle))		return;
		aartbikUnsubmitted Done Reply Inline Actions I think we should not have this create/destroy, but a simple startModule/endModule (with restrictions that they are called in certain ways) and then use the handle below. aartbik: I think we should not have this create/destroy, but a simple startModule/endModule (with…
		aartbikUnsubmitted Done Reply Inline Actions document that scoped context is for cinit aartbik: document that scoped context is for cinit
return reinterpret_cast<void *>(handle);
}		}
		aartbikUnsubmitted Done Reply Inline Actions I would do assert(!cusparse_env) so we detect double calls in debug mode aartbik: I would do assert(!cusparse_env) so we detect double calls in debug mode

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroySparseEnv(void h, CUstream /stream*/) {		mgpuDestroySparseEnv(CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(handle))		CUSPARSE_REPORT_IF_ERROR(cusparseDestroy(hstorage.env))
		hstorage.initiated = false;
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void *
mgpuCreateDnVec(intptr_t size, void values, int32_t dtp, CUstream /stream*/) {		mgpuCreateDnVec(intptr_t size, void values, int32_t dtp, CUstream /stream*/) {
cusparseDnVecDescr_t vec = nullptr;		cusparseDnVecDescr_t vec = nullptr;
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
CUSPARSE_REPORT_IF_ERROR(cusparseCreateDnVec(&vec, size, values, dTp))		CUSPARSE_REPORT_IF_ERROR(cusparseCreateDnVec(&vec, size, values, dTp))
return reinterpret_cast<void *>(vec);		return reinterpret_cast<void *>(vec);
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroySpMat(void m, CUstream /stream*/) {		mgpuDestroySpMat(void m, CUstream /stream*/) {
cusparseSpMatDescr_t mat = reinterpret_cast<cusparseSpMatDescr_t>(m);		cusparseSpMatDescr_t mat = reinterpret_cast<cusparseSpMatDescr_t>(m);
CUSPARSE_REPORT_IF_ERROR(cusparseDestroySpMat(mat))		CUSPARSE_REPORT_IF_ERROR(cusparseDestroySpMat(mat))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t mgpuSpMVBufferSize(
mgpuSpMVBufferSize(void h, int32_t ma, void a, void x, void y, int32_t ctp,		int32_t ma, void a, void x, void y, int32_t ctp, CUstream /stream*/) {
CUstream /stream/) {		ScopedCuSparseHandleStorage hstorage;
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);		cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);
cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);		cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(cusparseSpMV_bufferSize(
cusparseSpMV_bufferSize(handle, modeA, alphap, matA, vecX, betap, vecY,		hstorage.env, modeA, alphap, matA, vecX, betap, vecY, cTp,
cTp, CUSPARSE_SPMV_ALG_DEFAULT, &bufferSize))		CUSPARSE_SPMV_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(void h, int32_t ma, void a,		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMV(int32_t ma, void a, void x,
void x, void y,		void *y, int32_t ctp,
int32_t ctp, void *buf,		void *buf,
CUstream /stream/) {		CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);
		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);		cusparseDnVecDescr_t vecX = reinterpret_cast<cusparseDnVecDescr_t>(x);
cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);		cusparseDnVecDescr_t vecY = reinterpret_cast<cusparseDnVecDescr_t>(y);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSpMV(handle, modeA, alphap, matA, vecX,		CUSPARSE_REPORT_IF_ERROR(cusparseSpMV(hstorage.env, modeA, alphap, matA, vecX,
betap, vecY, cTp,		betap, vecY, cTp,
CUSPARSE_SPMV_ALG_DEFAULT, buf))		CUSPARSE_SPMV_ALG_DEFAULT, buf))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t
mgpuSpMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSpMMBufferSize(int32_t ma, int32_t mb, void a, void b, void *c,
int32_t ctp, CUstream /stream/) {		int32_t ctp, CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);		cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseSpMM_bufferSize(		CUSPARSE_REPORT_IF_ERROR(cusparseSpMM_bufferSize(
handle, modeA, modeB, alphap, matA, matB, betap, matC, cTp,		hstorage.env, modeA, modeB, alphap, matA, matB, betap, matC, cTp,
CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize))		CUSPARSE_SPMM_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSpMM(int32_t ma, int32_t mb,
mgpuSpMM(void h, int32_t ma, int32_t mb, void a, void b, void c,		void a, void b, void *c,
int32_t ctp, void buf, CUstream /stream*/) {		int32_t ctp, void *buf,
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		CUstream /stream/) {
		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);		cusparseSpMatDescr_t matA = reinterpret_cast<cusparseSpMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);		cusparseDnMatDescr_t matC = reinterpret_cast<cusparseDnMatDescr_t>(c);
cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);		cudaDataType_t cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSpMM(handle, modeA, modeB, alphap, matA,		CUSPARSE_REPORT_IF_ERROR(cusparseSpMM(hstorage.env, modeA, modeB, alphap,
matB, betap, matC, cTp,		matA, matB, betap, matC, cTp,
CUSPARSE_SPMM_ALG_DEFAULT, buf))		CUSPARSE_SPMM_ALG_DEFAULT, buf))
}		}

// TODO: add support to passing alpha and beta as arguments		// TODO: add support to passing alpha and beta as arguments
extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t		extern "C" MLIR_CUDA_WRAPPERS_EXPORT intptr_t
mgpuSDDMMBufferSize(void h, int32_t ma, int32_t mb, void a, void b, void c,		mgpuSDDMMBufferSize(int32_t ma, int32_t mb, void a, void b, void *c,
int32_t ctp, CUstream /stream/) {		int32_t ctp, CUstream /stream/) {
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);		cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);		cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);
auto cTp = static_cast<cudaDataType_t>(ctp);		auto cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
size_t bufferSize = 0;		size_t bufferSize = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM_bufferSize(		CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM_bufferSize(
handle, modeA, modeB, alphap, matA, matB, betap, matC, cTp,		hstorage.env, modeA, modeB, alphap, matA, matB, betap, matC, cTp,
CUSPARSE_SDDMM_ALG_DEFAULT, &bufferSize))		CUSPARSE_SDDMM_ALG_DEFAULT, &bufferSize))
return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc		return bufferSize == 0 ? 1 : bufferSize; // avoid zero-alloc
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSDDMM(int32_t ma, int32_t mb,
mgpuSDDMM(void h, int32_t ma, int32_t mb, void a, void b, void c,		void a, void b, void *c,
int32_t ctp, void buf, CUstream /stream*/) {		int32_t ctp, void *buf,
cusparseHandle_t handle = reinterpret_cast<cusparseHandle_t>(h);		CUstream /stream/) {
		ScopedCuSparseHandleStorage hstorage;
cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);		cusparseDnMatDescr_t matA = reinterpret_cast<cusparseDnMatDescr_t>(a);
cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);		cusparseDnMatDescr_t matB = reinterpret_cast<cusparseDnMatDescr_t>(b);
cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);		cusparseSpMatDescr_t matC = reinterpret_cast<cusparseSpMatDescr_t>(c);
auto cTp = static_cast<cudaDataType_t>(ctp);		auto cTp = static_cast<cudaDataType_t>(ctp);
ALPHABETA(cTp, alpha, beta)		ALPHABETA(cTp, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM(handle, modeA, modeB, alphap, matA,		CUSPARSE_REPORT_IF_ERROR(cusparseSDDMM(hstorage.env, modeA, modeB, alphap,
matB, betap, matC, cTp,		matA, matB, betap, matC, cTp,
CUSPARSE_SDDMM_ALG_DEFAULT, buf))		CUSPARSE_SDDMM_ALG_DEFAULT, buf))
}		}

#ifdef MLIR_ENABLE_CUDA_CUSPARSELT		#ifdef MLIR_ENABLE_CUDA_CUSPARSELT

///		///
/// Wrapper methods for the cuSparseLt library.		/// Wrapper methods for the cuSparseLt library.
///		///
Show All 14 Lines	struct cusparseLtDnMatHandleAndData {
void *values{nullptr};		void *values{nullptr};
};		};

static_assert(sizeof(cusparseLtHandle_t) == 11024);		static_assert(sizeof(cusparseLtHandle_t) == 11024);
static_assert(sizeof(cusparseLtSpMatHandleAndData) == 44104);		static_assert(sizeof(cusparseLtSpMatHandleAndData) == 44104);
static_assert(sizeof(cusparseLtDnMatHandleAndData) == 11032);		static_assert(sizeof(cusparseLtDnMatHandleAndData) == 11032);

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCreateSparseLtEnv(void h, CUstream /stream*/) {		mgpuCreateSparseLtEnv(CUstream /stream/) {
// note that cuSparseLt still uses cusparseStatus_t		ScopedCuSparseLtHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(
cusparseLtInit(reinterpret_cast<cusparseLtHandle_t *>(h)))
}		}
		aartbikUnsubmitted Done Reply Inline Actions assert aartbik: assert

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroySparseLtEnv(void h, CUstream /stream*/) {		mgpuDestroySparseLtEnv(CUstream /stream/) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
CUSPARSE_REPORT_IF_ERROR(cusparseLtDestroy(handle))		CUSPARSE_REPORT_IF_ERROR(cusparseLtDestroy(&(hstorage.env)))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCreateCuSparseLtDnMat(void dh, void h, intptr_t rows, intptr_t cols,		mgpuCreateCuSparseLtDnMat(void dh, intptr_t rows, intptr_t cols, void values,
void values, int32_t dtp, CUstream /stream*/) {		int32_t dtp, CUstream /stream/) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;

// CusparseLt expects the descriptors to be zero-initialized.		// CusparseLt expects the descriptors to be zero-initialized.
memset(dh, 0, sizeof(cusparseLtDnMatHandleAndData));		memset(dh, 0, sizeof(cusparseLtDnMatHandleAndData));
auto dnmat_handle = reinterpret_cast<cusparseLtDnMatHandleAndData *>(dh);		auto dnmat_handle = reinterpret_cast<cusparseLtDnMatHandleAndData *>(dh);
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
// assuming row-major when deciding lda		// assuming row-major when deciding lda
CUSPARSE_REPORT_IF_ERROR(cusparseLtDenseDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtDenseDescriptorInit(
handle, &(dnmat_handle->mat), rows, cols, /lda=/cols,		&(hstorage.env), &(dnmat_handle->mat), rows, cols, /lda=/cols,
/alignment=/16, dTp, CUSPARSE_ORDER_ROW))		/alignment=/16, dTp, CUSPARSE_ORDER_ROW))
dnmat_handle->values = values;		dnmat_handle->values = values;
}		}

// This can be used to destroy both dense matrices and sparse matrices in		// This can be used to destroy both dense matrices and sparse matrices in
// cusparseLt		// cusparseLt
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroyCuSparseLtSpMat(void m, CUstream /stream*/) {		mgpuDestroyCuSparseLtSpMat(void m, CUstream /stream*/) {
auto matAndData = reinterpret_cast<cusparseLtSpMatHandleAndData *>(m);		auto matAndData = reinterpret_cast<cusparseLtSpMatHandleAndData *>(m);
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matAndData->mat)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matAndData->mat)))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroyCuSparseLtDnMat(void m, CUstream /stream*/) {		mgpuDestroyCuSparseLtDnMat(void m, CUstream /stream*/) {
auto matAndData = reinterpret_cast<cusparseLtDnMatHandleAndData *>(m);		auto matAndData = reinterpret_cast<cusparseLtDnMatHandleAndData *>(m);
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matAndData->mat)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matAndData->mat)))
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCusparseLtCreate2To4SpMat(void sh, void h, intptr_t rows, intptr_t cols,		mgpuCusparseLtCreate2To4SpMat(void *sh, intptr_t rows, intptr_t cols,
void values, int32_t dtp, CUstream /stream*/) {		void values, int32_t dtp, CUstream /stream*/) {
auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);		auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);
// CusparseLt expects the descriptors to be zero-initialized.		// CusparseLt expects the descriptors to be zero-initialized.
memset(spmat_handle, 0, sizeof(cusparseLtSpMatHandleAndData));		memset(spmat_handle, 0, sizeof(cusparseLtSpMatHandleAndData));
spmat_handle->values = values;		spmat_handle->values = values;
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto dTp = static_cast<cudaDataType_t>(dtp);		auto dTp = static_cast<cudaDataType_t>(dtp);
// assuming row-major when deciding lda		// assuming row-major when deciding lda
CUSPARSE_REPORT_IF_ERROR(cusparseLtStructuredDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtStructuredDescriptorInit(
handle, &(spmat_handle->mat), rows, cols, /ld=/cols, /alignment=/16,		&(hstorage.env), &(spmat_handle->mat), rows, cols, /ld=/cols,
dTp, CUSPARSE_ORDER_ROW, CUSPARSELT_SPARSITY_50_PERCENT))		/alignment=/16, dTp, CUSPARSE_ORDER_ROW,
		CUSPARSELT_SPARSITY_50_PERCENT))
}		}

// Several things are being done in this stage, algorithm selection, planning,		// Several things are being done in this stage, algorithm selection, planning,
// and returning workspace and compressed matrices data buffer sizes.		// and returning workspace and compressed matrices data buffer sizes.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCuSparseLtSpMMBufferSize(void bs, void h, int32_t ma, int32_t mb, void *a,		mgpuCuSparseLtSpMMBufferSize(void bs, int32_t ma, int32_t mb, void a, void *b,
void b, void c, int32_t ctp,		void c, int32_t ctp, CUstream /stream*/) {
CUstream /stream/) {
// TODO: support more advanced settings, e.g., the input right operand is a		// TODO: support more advanced settings, e.g., the input right operand is a
// sparse matrix assuming matA is the sparse matrix		// sparse matrix assuming matA is the sparse matrix
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);		auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);
auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);		auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);
auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);		auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);
auto workspace_size = reinterpret_cast<int64_t *>(bs);		auto workspace_size = reinterpret_cast<int64_t *>(bs);
auto compressed_size = &(reinterpret_cast<int64_t *>(bs)[1]);		auto compressed_size = &(reinterpret_cast<int64_t *>(bs)[1]);
auto compressed_buffer_size = &(reinterpret_cast<int64_t *>(bs)[2]);		auto compressed_buffer_size = &(reinterpret_cast<int64_t *>(bs)[2]);
size_t workspace_size_, compressed_size_, compressed_buffer_size_;		size_t workspace_size_, compressed_size_, compressed_buffer_size_;
auto cTp = static_cast<cusparseComputeType>(ctp);		auto cTp = static_cast<cusparseComputeType>(ctp);

cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);		cusparseOperation_t modeA = static_cast<cusparseOperation_t>(ma);
cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);		cusparseOperation_t modeB = static_cast<cusparseOperation_t>(mb);
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulDescriptorInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulDescriptorInit(
handle, &(matA->matmul), modeA, modeB, &(matA->mat), &(matB->mat),		&(hstorage.env), &(matA->matmul), modeA, modeB, &(matA->mat),
&(matC->mat), &(matC->mat), cTp))		&(matB->mat), &(matC->mat), &(matC->mat), cTp))
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSelectionInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSelectionInit(
handle, &(matA->alg_sel), &(matA->matmul), CUSPARSELT_MATMUL_ALG_DEFAULT))		&(hstorage.env), &(matA->alg_sel), &(matA->matmul),
		CUSPARSELT_MATMUL_ALG_DEFAULT))
int alg = 0;		int alg = 0;
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(
handle, &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,		&(hstorage.env), &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,
sizeof(alg)))		sizeof(alg)))

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(
handle, &(matA->plan), &(matA->matmul), &(matA->alg_sel)))		&(hstorage.env), &(matA->plan), &(matA->matmul), &(matA->alg_sel)))

CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulGetWorkspace(
cusparseLtMatmulGetWorkspace(handle, &(matA->plan), &workspace_size_))		&(hstorage.env), &(matA->plan), &workspace_size_))
CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(		CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(
handle, &(matA->plan), &compressed_size_, &compressed_buffer_size_))		&(hstorage.env), &(matA->plan), &compressed_size_,
		&compressed_buffer_size_))

// avoid zero-alloc		// avoid zero-alloc
*workspace_size = (workspace_size_ == 0 ? 1 : workspace_size_);		*workspace_size = (workspace_size_ == 0 ? 1 : workspace_size_);
*compressed_size = (compressed_size_ == 0 ? 1 : compressed_size_);		*compressed_size = (compressed_size_ == 0 ? 1 : compressed_size_);
*compressed_buffer_size =		*compressed_buffer_size =
(compressed_buffer_size_ == 0 ? 1 : compressed_buffer_size_);		(compressed_buffer_size_ == 0 ? 1 : compressed_buffer_size_);
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCuSparseLtSpMM(void h, void a, void b, void c, void *d_workspace,		mgpuCuSparseLtSpMM(void a, void b, void c, void d_workspace,
void dA_compressed, void dA_compressedBuffer,		void dA_compressed, void dA_compressedBuffer,
CUstream stream) {		CUstream stream) {
auto handle = reinterpret_cast<cusparseLtHandle_t *>(h);		ScopedCuSparseLtHandleStorage hstorage;
auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);		auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);
auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);		auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);
auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);		auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);

ALPHABETA(CUDA_R_32F, alpha, beta)		ALPHABETA(CUDA_R_32F, alpha, beta)
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(
cusparseLtSpMMACompress(handle, &(matA->plan), (matA->values),		cusparseLtSpMMACompress(&(hstorage.env), &(matA->plan), (matA->values),
dA_compressed, dA_compressedBuffer, stream))		dA_compressed, dA_compressedBuffer, stream))

// TODO: add support to multi-stream execution		// TODO: add support to multi-stream execution
// Perform the matrix multiplication. D = A*B+C using C==D for now		// Perform the matrix multiplication. D = A*B+C using C==D for now
CUSPARSE_REPORT_IF_ERROR(		CUSPARSE_REPORT_IF_ERROR(
cusparseLtMatmul(handle, &(matA->plan), alphap, dA_compressed,		cusparseLtMatmul(&(hstorage.env), &(matA->plan), alphap, dA_compressed,
matB->values, betap, matC->values,		matB->values, betap, matC->values,
/dD/ matC->values, d_workspace, nullptr, 0))		/dD/ matC->values, d_workspace, nullptr, 0))

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matA->mat)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(matA->mat)))
// destroy the plan associated with the sparse matrix		// destroy the plan associated with the sparse matrix
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanDestroy(&(matA->plan)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanDestroy(&(matA->plan)))
}		}

#endif // MLIR_ENABLE_CUDA_CUSPARSELT		#endif // MLIR_ENABLE_CUDA_CUSPARSELT
#endif // MLIR_ENABLE_CUDA_CUSPARSE		#endif // MLIR_ENABLE_CUDA_CUSPARSE

mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir

Show All 14 Lines	module attributes {gpu.container_module} {
// CHECK: llvm.call @mgpuDestroyCuSparseLtDnMat		// CHECK: llvm.call @mgpuDestroyCuSparseLtDnMat
// CHECK: llvm.call @mgpuDestroySparseLtEnv		// CHECK: llvm.call @mgpuDestroySparseLtEnv
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
func.func @matmul(%arg0: index) {		func.func @matmul(%arg0: index) {
%token0 = gpu.wait async		%token0 = gpu.wait async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xf16>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xf16>
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf16>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf16>
%env, %token3 = gpu.create_sparse_env async [%token2]		%token3 = gpu.create_sparse_env async [%token2] CUSPARSE_AND_CUSPARSE_LT
%spmat, %token4 = gpu.create_2to4_spmat async [%token3] %env, %arg0, %arg0, %mem1: memref<?xf16>		%spmat, %token4 = gpu.create_2to4_spmat async [%token3] %arg0, %arg0, %mem1: memref<?xf16>
%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf16>		%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf16>
%bufferSz0, %bufferSz1, %bufferSz2, %token6 = gpu.spmm_buffer_size async [%token5] %env, %spmat, %dnmat, %dnmat : index,index,index into f16		%bufferSz0, %bufferSz1, %bufferSz2, %token6 = gpu.spmm_buffer_size async [%token5] %spmat, %dnmat, %dnmat : index,index,index into f16
%token7 = gpu.spmm async [%token6] %env, %spmat, %dnmat, %dnmat, %mem2, %mem2, %mem2 : memref<?xf16>,memref<?xf16>,memref<?xf16> into f16		%token7 = gpu.spmm async [%token6] %spmat, %dnmat, %dnmat, %mem2, %mem2, %mem2 : memref<?xf16>,memref<?xf16>,memref<?xf16> into f16
		aartbikUnsubmitted Done Reply Inline Actions I would completely remove the calls here aartbik: I would completely remove the calls here
%token8 = gpu.destroy_sp_mat async [%token7] %spmat		%token8 = gpu.destroy_sp_mat async [%token7] %spmat
%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat		%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
%token10 = gpu.destroy_sparse_env async [%token9] %env		%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE_AND_CUSPARSE_LT
gpu.wait [%token10]		gpu.wait [%token10]
return		return
}		}

}		}

mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir

Show All 14 Lines	module attributes {gpu.container_module} {
// CHECK: llvm.call @mgpuDestroyDnVec		// CHECK: llvm.call @mgpuDestroyDnVec
// CHECK: llvm.call @mgpuDestroySparseEnv		// CHECK: llvm.call @mgpuDestroySparseEnv
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
func.func @matvec(%arg0: index) {		func.func @matvec(%arg0: index) {
%token0 = gpu.wait async		%token0 = gpu.wait async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
%env, %token3 = gpu.create_sparse_env async [%token2]		%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
		aartbikUnsubmitted Done Reply Inline Actions it does not make sense to get a token on this anymore, as Peiming said above but more importantly, I think we need a module level setup aartbik: it does not make sense to get a token on this anymore, as Peiming said above but more…
%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>		%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
%dnvec, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0 : index into memref<?xf64>		%dnvec, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0 : index into memref<?xf64>
%bufferSz, %token6 = gpu.spmv_buffer_size async [%token5] %env, %spmat, %dnvec, %dnvec into f64		%bufferSz, %token6 = gpu.spmv_buffer_size async [%token5] %spmat, %dnvec, %dnvec into f64
%token7 = gpu.spmv async [%token6] %env, %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64		%token7 = gpu.spmv async [%token6] %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64
%token8 = gpu.destroy_sp_mat async [%token7] %spmat		%token8 = gpu.destroy_sp_mat async [%token7] %spmat
%token9 = gpu.destroy_dn_tensor async [%token8] %dnvec		%token9 = gpu.destroy_dn_tensor async [%token8] %dnvec
%token10 = gpu.destroy_sparse_env async [%token9] %env		%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
gpu.wait [%token10]		gpu.wait [%token10]
		aartbikUnsubmitted Done Reply Inline Actions same here, since we do not run this code, just leave out the create/destroy calls form the IR aartbik: same here, since we do not run this code, just leave out the create/destroy calls form the IR
return		return
}		}

// CHECK-LABEL: func @matmul		// CHECK-LABEL: func @matmul
// CHECK: llvm.call @mgpuStreamCreate		// CHECK: llvm.call @mgpuStreamCreate
// CHECK: llvm.call @mgpuMemAlloc		// CHECK: llvm.call @mgpuMemAlloc
// CHECK: llvm.call @mgpuMemAlloc		// CHECK: llvm.call @mgpuMemAlloc
// CHECK: llvm.call @mgpuCreateSparseEnv		// CHECK: llvm.call @mgpuCreateSparseEnv
// CHECK: llvm.call @mgpuCreateCsr		// CHECK: llvm.call @mgpuCreateCsr
// CHECK: llvm.call @mgpuCreateDnMat		// CHECK: llvm.call @mgpuCreateDnMat
// CHECK: llvm.call @mgpuSpMMBufferSize		// CHECK: llvm.call @mgpuSpMMBufferSize
// CHECK: llvm.call @mgpuSpMM		// CHECK: llvm.call @mgpuSpMM
// CHECK: llvm.call @mgpuDestroySpMat		// CHECK: llvm.call @mgpuDestroySpMat
// CHECK: llvm.call @mgpuDestroyDnMat		// CHECK: llvm.call @mgpuDestroyDnMat
// CHECK: llvm.call @mgpuDestroySparseEnv		// CHECK: llvm.call @mgpuDestroySparseEnv
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
func.func @matmul(%arg0: index) {		func.func @matmul(%arg0: index) {
%token0 = gpu.wait async		%token0 = gpu.wait async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
%env, %token3 = gpu.create_sparse_env async [%token2]		%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>		%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf64>		%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf64>
%bufferSz, %token6 = gpu.spmm_buffer_size async [%token5] %env, %spmat, %dnmat, %dnmat : index into f64		%bufferSz, %token6 = gpu.spmm_buffer_size async [%token5] %spmat, %dnmat, %dnmat : index into f64
%token7 = gpu.spmm async [%token6] %env, %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64		%token7 = gpu.spmm async [%token6] %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64
%token8 = gpu.destroy_sp_mat async [%token7] %spmat		%token8 = gpu.destroy_sp_mat async [%token7] %spmat
%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat		%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
%token10 = gpu.destroy_sparse_env async [%token9] %env		%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
gpu.wait [%token10]		gpu.wait [%token10]
return		return
}		}

// CHECK-LABEL: func @sddmm		// CHECK-LABEL: func @sddmm
// CHECK: llvm.call @mgpuStreamCreate		// CHECK: llvm.call @mgpuStreamCreate
// CHECK: llvm.call @mgpuMemAlloc		// CHECK: llvm.call @mgpuMemAlloc
// CHECK: llvm.call @mgpuMemAlloc		// CHECK: llvm.call @mgpuMemAlloc
// CHECK: llvm.call @mgpuCreateSparseEnv		// CHECK: llvm.call @mgpuCreateSparseEnv
// CHECK: llvm.call @mgpuCreateCsr		// CHECK: llvm.call @mgpuCreateCsr
// CHECK: llvm.call @mgpuCreateDnMat		// CHECK: llvm.call @mgpuCreateDnMat
// CHECK: llvm.call @mgpuSDDMMBufferSize		// CHECK: llvm.call @mgpuSDDMMBufferSize
// CHECK: llvm.call @mgpuSDDMM		// CHECK: llvm.call @mgpuSDDMM
// CHECK: llvm.call @mgpuDestroySpMat		// CHECK: llvm.call @mgpuDestroySpMat
// CHECK: llvm.call @mgpuDestroyDnMat		// CHECK: llvm.call @mgpuDestroyDnMat
// CHECK: llvm.call @mgpuDestroySparseEnv		// CHECK: llvm.call @mgpuDestroySparseEnv
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
func.func @sddmm(%arg0: index) {		func.func @sddmm(%arg0: index) {
%token0 = gpu.wait async		%token0 = gpu.wait async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
%env, %token3 = gpu.create_sparse_env async [%token2]		%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>		%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf64>		%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf64>
%bufferSz, %token6 = gpu.sddmm_buffer_size async [%token5] %env, %dnmat, %dnmat, %spmat into f64		%bufferSz, %token6 = gpu.sddmm_buffer_size async [%token5] %dnmat, %dnmat, %spmat into f64
%token7 = gpu.sddmm async [%token6] %env, %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64		%token7 = gpu.sddmm async [%token6] %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64
%token8 = gpu.destroy_sp_mat async [%token7] %spmat		%token8 = gpu.destroy_sp_mat async [%token7] %spmat
%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat		%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
%token10 = gpu.destroy_sparse_env async [%token9] %env		%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
gpu.wait [%token10]		gpu.wait [%token10]
return		return
}		}

}		}

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	module attributes {gpu.container_module} {
func.func @sparse_ops(%arg0: index) {		func.func @sparse_ops(%arg0: index) {
// CHECK: gpu.wait async		// CHECK: gpu.wait async
%token0 = gpu.wait async		%token0 = gpu.wait async
// CHECK: gpu.alloc async		// CHECK: gpu.alloc async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
// CHECK: gpu.alloc async		// CHECK: gpu.alloc async
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
// CHECK: gpu.create_sparse_env async		// CHECK: gpu.create_sparse_env async
%env, %token3 = gpu.create_sparse_env async [%token2]		%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
		aartbikUnsubmitted Done Reply Inline Actions omit aartbik: omit
// CHECK: gpu.create_coo async		// CHECK: gpu.create_coo async
%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>		%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
// CHECK: gpu.create_csr async		// CHECK: gpu.create_csr async
%spmat2, %token5 = gpu.create_csr async [%token4] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>		%spmat2, %token5 = gpu.create_csr async [%token4] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
// CHECK: gpu.create_dn_tensor async		// CHECK: gpu.create_dn_tensor async
%dnvec, %token6 = gpu.create_dn_tensor async [%token5] %env, %mem2, %arg0 : index into memref<?xf64>		%dnvec, %token6 = gpu.create_dn_tensor async [%token5] %mem2, %arg0 : index into memref<?xf64>
// CHECK: gpu.spmv_buffer_size async		// CHECK: gpu.spmv_buffer_size async
%bufferSz, %token7 = gpu.spmv_buffer_size async [%token6] %env, %spmat, %dnvec, %dnvec into f64		%bufferSz, %token7 = gpu.spmv_buffer_size async [%token6] %spmat, %dnvec, %dnvec into f64
// CHECK: gpu.spmv async		// CHECK: gpu.spmv async
%token8 = gpu.spmv async [%token7] %env, %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64		%token8 = gpu.spmv async [%token7] %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64
// CHECK: gpu.create_dn_tensor async		// CHECK: gpu.create_dn_tensor async
%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf64>		%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %mem2, %arg0, %arg0 : index, index into memref<?xf64>
// CHECK: gpu.spmm_buffer_size async		// CHECK: gpu.spmm_buffer_size async
%bufferSz2, %token10 = gpu.spmm_buffer_size async [%token9] %env, %spmat, %dnmat, %dnmat : index into f64		%bufferSz2, %token10 = gpu.spmm_buffer_size async [%token9] %spmat, %dnmat, %dnmat : index into f64
// CHECK: gpu.spmm async		// CHECK: gpu.spmm async
%token11 = gpu.spmm async [%token10] %env, %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64		%token11 = gpu.spmm async [%token10] %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64
// CHECK: gpu.sddmm_buffer_size async		// CHECK: gpu.sddmm_buffer_size async
%bufferSz3, %token12 = gpu.sddmm_buffer_size async [%token11] %env, %dnmat, %dnmat, %spmat into f64		%bufferSz3, %token12 = gpu.sddmm_buffer_size async [%token11] %dnmat, %dnmat, %spmat into f64
// CHECK: gpu.sddmm async		// CHECK: gpu.sddmm async
%token13 = gpu.sddmm async [%token12] %env, %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64		%token13 = gpu.sddmm async [%token12] %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64
// CHECK: gpu.destroy_dn_tensor async		// CHECK: gpu.destroy_dn_tensor async
%token14 = gpu.destroy_dn_tensor async [%token13] %dnmat		%token14 = gpu.destroy_dn_tensor async [%token13] %dnmat
// CHECK: gpu.destroy_sp_mat async		// CHECK: gpu.destroy_sp_mat async
%token15 = gpu.destroy_sp_mat async [%token14] %spmat		%token15 = gpu.destroy_sp_mat async [%token14] %spmat
// CHECK: gpu.destroy_dn_tensor async		// CHECK: gpu.destroy_dn_tensor async
%token16 = gpu.destroy_dn_tensor async [%token15] %dnvec		%token16 = gpu.destroy_dn_tensor async [%token15] %dnvec
// CHECK: gpu.destroy_sparse_env async		// CHECK: gpu.destroy_sparse_env async
%token17 = gpu.destroy_sparse_env async [%token16] %env		%token17 = gpu.destroy_sparse_env async [%token16] CUSPARSE
		aartbikUnsubmitted Done Reply Inline Actions omit aartbik: omit
// CHECK: gpu.wait		// CHECK: gpu.wait
gpu.wait [%token17]		gpu.wait [%token17]
return		return
}		}
}		}

// Just check that this doesn't crash.		// Just check that this doesn't crash.
gpu.module @module {		gpu.module @module {
"gpu.func"() ({		"gpu.func"() ({
gpu.return		gpu.return
}) {function_type = () -> (), sym_name = "func"} : () -> ()		}) {function_type = () -> (), sym_name = "func"} : () -> ()
}		}

mlir/test/Dialect/GPU/sparse-roundtrip.mlir

	// RUN: mlir-opt %s -split-input-file \| mlir-opt -split-input-file \| FileCheck %s			// RUN: mlir-opt %s -split-input-file \| mlir-opt -split-input-file \| FileCheck %s

	module attributes {gpu.container_module} {			module attributes {gpu.container_module} {

	// CHECK-LABEL: func @matvec			// CHECK-LABEL: func @matvec
	// CHECK: %{{.*}} = gpu.wait async			// CHECK: %{{.*}} = gpu.wait async
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_sparse_env async [%{{.*}}]			// CHECK: %{{.}} = gpu.create_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: %{{.}}, %{{.}} = gpu.create_coo async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_coo async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.}}, %{{.}} : index into memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.*}} : index into memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.spmv_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.*}} into f64			// CHECK: %{{.}}, %{{.}} = gpu.spmv_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}} into f64
	// CHECK: %{{.}} = gpu.spmv async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xf64> into f64			// CHECK: %{{.}} = gpu.spmv async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}} : memref<?xf64> into f64
	// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: gpu.wait [%{{.*}}]			// CHECK: gpu.wait [%{{.*}}]
	// CHECK: return			// CHECK: return
	func.func @matvec(%arg0: index) {			func.func @matvec(%arg0: index) {
	%token0 = gpu.wait async			%token0 = gpu.wait async
	%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>			%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
	%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>			%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
	%env, %token3 = gpu.create_sparse_env async [%token2]			%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
	%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>			%spmat, %token4 = gpu.create_coo async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
	%dnvec, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0 : index into memref<?xf64>			%dnvec, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0 : index into memref<?xf64>
	%bufferSz, %token6 = gpu.spmv_buffer_size async [%token5] %env, %spmat, %dnvec, %dnvec into f64			%bufferSz, %token6 = gpu.spmv_buffer_size async [%token5] %spmat, %dnvec, %dnvec into f64
	%token7 = gpu.spmv async [%token6] %env, %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64			%token7 = gpu.spmv async [%token6] %spmat, %dnvec, %dnvec, %mem2 : memref<?xf64> into f64
	%token8 = gpu.destroy_sp_mat async [%token7] %spmat			%token8 = gpu.destroy_sp_mat async [%token7] %spmat
	%token9 = gpu.destroy_dn_tensor async [%token8] %dnvec			%token9 = gpu.destroy_dn_tensor async [%token8] %dnvec
	%token10 = gpu.destroy_sparse_env async [%token9] %env			%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
	gpu.wait [%token10]			gpu.wait [%token10]
	return			return
	}			}

	// CHECK-LABEL: func @matmul			// CHECK-LABEL: func @matmul
	// CHECK: %{{.*}} = gpu.wait async			// CHECK: %{{.*}} = gpu.wait async
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_sparse_env async [%{{.*}}]			// CHECK: %{{.}} = gpu.create_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: %{{.}}, %{{.}} = gpu.create_csr async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_csr async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.*}} : index, index into memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.}}, %{{.}} : index, index into memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.spmm_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.*}} into f64			// CHECK: %{{.}}, %{{.}} = gpu.spmm_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}} into f64
	// CHECK: %{{.}} = gpu.spmm async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xf64> into f64			// CHECK: %{{.}} = gpu.spmm async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}} : memref<?xf64> into f64
	// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: gpu.wait [%{{.*}}]			// CHECK: gpu.wait [%{{.*}}]
	// CHECK: return			// CHECK: return
	func.func @matmul(%arg0: index) {			func.func @matmul(%arg0: index) {
	%token0 = gpu.wait async			%token0 = gpu.wait async
	%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>			%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
	%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>			%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
	%env, %token3 = gpu.create_sparse_env async [%token2]			%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
	%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>			%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
	%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf64>			%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf64>
	%bufferSz, %token6 = gpu.spmm_buffer_size async [%token5] %env, %spmat, %dnmat, %dnmat : index into f64			%bufferSz, %token6 = gpu.spmm_buffer_size async [%token5] %spmat, %dnmat, %dnmat : index into f64
	%token7 = gpu.spmm async [%token6] %env, %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64			%token7 = gpu.spmm async [%token6] %spmat, %dnmat, %dnmat, %mem2 : memref<?xf64> into f64
	%token8 = gpu.destroy_sp_mat async [%token7] %spmat			%token8 = gpu.destroy_sp_mat async [%token7] %spmat
	%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat			%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
	%token10 = gpu.destroy_sparse_env async [%token9] %env			%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
	gpu.wait [%token10]			gpu.wait [%token10]
	return			return
	}			}

	// CHECK-LABEL: func @sddmm			// CHECK-LABEL: func @sddmm
	// CHECK: %{{.*}} = gpu.wait async			// CHECK: %{{.*}} = gpu.wait async
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xindex>
	// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.alloc async [%{{.}}] (%{{.}}) : memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_sparse_env async [%{{.*}}]			// CHECK: %{{.}} = gpu.create_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: %{{.}}, %{{.}} = gpu.create_csr async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_csr async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.*}} : index, index into memref<?xf64>			// CHECK: %{{.}}, %{{.}} = gpu.create_dn_tensor async [%{{.}}] %{{.}}, %{{.}}, %{{.}} : index, index into memref<?xf64>
	// CHECK: %{{.}}, %{{.}} = gpu.sddmm_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.*}} into f64			// CHECK: %{{.}}, %{{.}} = gpu.sddmm_buffer_size async [%{{.}}] %{{.}}, %{{.}}, %{{.}} into f64
	// CHECK: %{{.}} = gpu.sddmm async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}}, %{{.*}} : memref<?xf64> into f64			// CHECK: %{{.}} = gpu.sddmm async [%{{.}}] %{{.}}, %{{.}}, %{{.}}, %{{.}} : memref<?xf64> into f64
	// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sp_mat async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_dn_tensor async [%{{.}}] %{{.*}}
	// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] %{{.*}}			// CHECK: %{{.}} = gpu.destroy_sparse_env async [%{{.}}] CUSPARSE
	// CHECK: gpu.wait [%{{.*}}]			// CHECK: gpu.wait [%{{.*}}]
	// CHECK: return			// CHECK: return
	func.func @sddmm(%arg0: index) {			func.func @sddmm(%arg0: index) {
	%token0 = gpu.wait async			%token0 = gpu.wait async
	%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>			%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xindex>
	%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>			%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf64>
	%env, %token3 = gpu.create_sparse_env async [%token2]			%token3 = gpu.create_sparse_env async [%token2] CUSPARSE
	%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>			%spmat, %token4 = gpu.create_csr async [%token3] %arg0, %arg0, %arg0, %mem1, %mem1, %mem2 : memref<?xindex>, memref<?xindex>, memref<?xf64>
	%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %env, %mem2, %arg0, %arg0 : index, index into memref<?xf64>			%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf64>
	%bufferSz, %token6 = gpu.sddmm_buffer_size async [%token5] %env, %dnmat, %dnmat, %spmat into f64			%bufferSz, %token6 = gpu.sddmm_buffer_size async [%token5] %dnmat, %dnmat, %spmat into f64
	%token7 = gpu.sddmm async [%token6] %env, %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64			%token7 = gpu.sddmm async [%token6] %dnmat, %dnmat, %spmat, %mem2 : memref<?xf64> into f64
	%token8 = gpu.destroy_sp_mat async [%token7] %spmat			%token8 = gpu.destroy_sp_mat async [%token7] %spmat
	%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat			%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
	%token10 = gpu.destroy_sparse_env async [%token9] %env			%token10 = gpu.destroy_sparse_env async [%token9] CUSPARSE
	gpu.wait [%token10]			gpu.wait [%token10]
	return			return
	}			}

	}			}

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul_lib.mlir

	Show All 39 Lines
	// CHECK: %[[VAL_34:.*]] = bufferization.to_memref %[[VAL_2]] : memref<?x?xf64>			// CHECK: %[[VAL_34:.*]] = bufferization.to_memref %[[VAL_2]] : memref<?x?xf64>
	// CHECK: %[[VAL_35:.*]] = gpu.wait async			// CHECK: %[[VAL_35:.*]] = gpu.wait async
	// CHECK: %[[VAL_36:.*]] = memref.dim %[[VAL_34]], %[[VAL_3]] : memref<?x?xf64>			// CHECK: %[[VAL_36:.*]] = memref.dim %[[VAL_34]], %[[VAL_3]] : memref<?x?xf64>
	// CHECK: %[[VAL_37:.*]] = memref.dim %[[VAL_34]], %[[VAL_4]] : memref<?x?xf64>			// CHECK: %[[VAL_37:.*]] = memref.dim %[[VAL_34]], %[[VAL_4]] : memref<?x?xf64>
	// CHECK: %[[VAL_38:.]], %[[VAL_39:.]] = gpu.alloc async {{\[}}%[[VAL_35]]] (%[[VAL_36]], %[[VAL_37]]) : memref<?x?xf64>			// CHECK: %[[VAL_38:.]], %[[VAL_39:.]] = gpu.alloc async {{\[}}%[[VAL_35]]] (%[[VAL_36]], %[[VAL_37]]) : memref<?x?xf64>
	// CHECK: %[[VAL_40:.*]] = gpu.memcpy async {{\[}}%[[VAL_39]]] %[[VAL_38]], %[[VAL_34]] : memref<?x?xf64>, memref<?x?xf64>			// CHECK: %[[VAL_40:.*]] = gpu.memcpy async {{\[}}%[[VAL_39]]] %[[VAL_38]], %[[VAL_34]] : memref<?x?xf64>, memref<?x?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_16]], %[[VAL_21]], %[[VAL_26]], %[[VAL_33]], %[[VAL_40]]]			// CHECK: gpu.wait {{\[}}%[[VAL_16]], %[[VAL_21]], %[[VAL_26]], %[[VAL_33]], %[[VAL_40]]]
	// CHECK: %[[VAL_41:.*]] = gpu.wait async			// CHECK: %[[VAL_41:.*]] = gpu.wait async
	// CHECK: %[[VAL_42:.]], %[[VAL_43:.]] = gpu.create_sparse_env async {{\[}}%[[VAL_41]]]			// CHECK: %[[VAL_43:.*]] = gpu.create_sparse_env async {{\[}}%[[VAL_41]]] CUSPARSE
	// CHECK: %[[VAL_44:.]], %[[VAL_45:.]] = gpu.create_csr async {{\[}}%[[VAL_43]]] %[[VAL_6]], %[[VAL_7]], %[[VAL_5]], %[[VAL_14]], %[[VAL_19]], %[[VAL_24]] : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %[[VAL_44:.]], %[[VAL_45:.]] = gpu.create_csr async {{\[}}%[[VAL_43]]] %[[VAL_6]], %[[VAL_7]], %[[VAL_5]], %[[VAL_14]], %[[VAL_19]], %[[VAL_24]] : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %[[VAL_46:.]], %[[VAL_47:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_45]]] %[[VAL_42]], %[[VAL_31]], %[[VAL_7]], %[[VAL_8]] : index, index into memref<?x?xf64>			// CHECK: %[[VAL_46:.]], %[[VAL_47:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_45]]] %[[VAL_31]], %[[VAL_7]], %[[VAL_8]] : index, index into memref<?x?xf64>
	// CHECK: %[[VAL_48:.]], %[[VAL_49:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_47]]] %[[VAL_42]], %[[VAL_38]], %[[VAL_6]], %[[VAL_8]] : index, index into memref<?x?xf64>			// CHECK: %[[VAL_48:.]], %[[VAL_49:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_47]]] %[[VAL_38]], %[[VAL_6]], %[[VAL_8]] : index, index into memref<?x?xf64>
	// CHECK: %[[VAL_50:.]], %[[VAL_51:.]] = gpu.spmm_buffer_size async {{\[}}%[[VAL_49]]] %[[VAL_42]], %[[VAL_44]], %[[VAL_46]], %[[VAL_48]] : index			// CHECK: %[[VAL_50:.]], %[[VAL_51:.]] = gpu.spmm_buffer_size async {{\[}}%[[VAL_49]]] %[[VAL_44]], %[[VAL_46]], %[[VAL_48]] : index
	// CHECK: %[[VAL_52:.]], %[[VAL_53:.]] = gpu.alloc async {{\[}}%[[VAL_51]]] (%[[VAL_50]]) : memref<?xi8>			// CHECK: %[[VAL_52:.]], %[[VAL_53:.]] = gpu.alloc async {{\[}}%[[VAL_51]]] (%[[VAL_50]]) : memref<?xi8>
	// CHECK: %[[VAL_54:.*]] = gpu.spmm async {{\[}}%[[VAL_53]]] %[[VAL_42]], %[[VAL_44]], %[[VAL_46]], %[[VAL_48]], %[[VAL_52]] : memref<?xi8>			// CHECK: %[[VAL_54:.*]] = gpu.spmm async {{\[}}%[[VAL_53]]] %[[VAL_44]], %[[VAL_46]], %[[VAL_48]], %[[VAL_52]] : memref<?xi8>
	// CHECK: %[[VAL_55:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_54]]] %[[VAL_44]]			// CHECK: %[[VAL_55:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_54]]] %[[VAL_44]]
	// CHECK: %[[VAL_56:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_55]]] %[[VAL_46]]			// CHECK: %[[VAL_56:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_55]]] %[[VAL_46]]
	// CHECK: %[[VAL_57:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_56]]] %[[VAL_48]]			// CHECK: %[[VAL_57:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_56]]] %[[VAL_48]]
	// CHECK: %[[VAL_58:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_57]]] %[[VAL_42]]			// CHECK: %[[VAL_58:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_57]]] CUSPARSE
	// CHECK: %[[VAL_59:.*]] = gpu.dealloc async {{\[}}%[[VAL_58]]] %[[VAL_14]] : memref<?xindex>			// CHECK: %[[VAL_59:.*]] = gpu.dealloc async {{\[}}%[[VAL_58]]] %[[VAL_14]] : memref<?xindex>
	// CHECK: %[[VAL_60:.*]] = gpu.dealloc async {{\[}}%[[VAL_59]]] %[[VAL_19]] : memref<?xindex>			// CHECK: %[[VAL_60:.*]] = gpu.dealloc async {{\[}}%[[VAL_59]]] %[[VAL_19]] : memref<?xindex>
	// CHECK: %[[VAL_61:.*]] = gpu.dealloc async {{\[}}%[[VAL_60]]] %[[VAL_24]] : memref<?xf64>			// CHECK: %[[VAL_61:.*]] = gpu.dealloc async {{\[}}%[[VAL_60]]] %[[VAL_24]] : memref<?xf64>
	// CHECK: %[[VAL_62:.*]] = gpu.dealloc async {{\[}}%[[VAL_61]]] %[[VAL_52]] : memref<?xi8>			// CHECK: %[[VAL_62:.*]] = gpu.dealloc async {{\[}}%[[VAL_61]]] %[[VAL_52]] : memref<?xi8>
	// CHECK: %[[VAL_63:.*]] = gpu.dealloc async {{\[}}%[[VAL_62]]] %[[VAL_31]] : memref<?x?xf64>			// CHECK: %[[VAL_63:.*]] = gpu.dealloc async {{\[}}%[[VAL_62]]] %[[VAL_31]] : memref<?x?xf64>
	// CHECK: %[[VAL_64:.*]] = gpu.memcpy async {{\[}}%[[VAL_63]]] %[[VAL_34]], %[[VAL_38]] : memref<?x?xf64>, memref<?x?xf64>			// CHECK: %[[VAL_64:.*]] = gpu.memcpy async {{\[}}%[[VAL_63]]] %[[VAL_34]], %[[VAL_38]] : memref<?x?xf64>, memref<?x?xf64>
	// CHECK: %[[VAL_65:.*]] = gpu.dealloc async {{\[}}%[[VAL_64]]] %[[VAL_38]] : memref<?x?xf64>			// CHECK: %[[VAL_65:.*]] = gpu.dealloc async {{\[}}%[[VAL_64]]] %[[VAL_38]] : memref<?x?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_65]]]			// CHECK: gpu.wait {{\[}}%[[VAL_65]]]
	Show All 9 Lines

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec_lib.mlir

	Show All 37 Lines
	// CHECK: %[[VAL_31:.*]] = gpu.memcpy async {{\[}}%[[VAL_30]]] %[[VAL_29]], %[[VAL_26]] : memref<?xf64>, memref<?xf64>			// CHECK: %[[VAL_31:.*]] = gpu.memcpy async {{\[}}%[[VAL_30]]] %[[VAL_29]], %[[VAL_26]] : memref<?xf64>, memref<?xf64>
	// CHECK: %[[VAL_32:.*]] = bufferization.to_memref %[[VAL_2]] : memref<?xf64>			// CHECK: %[[VAL_32:.*]] = bufferization.to_memref %[[VAL_2]] : memref<?xf64>
	// CHECK: %[[VAL_33:.*]] = gpu.wait async			// CHECK: %[[VAL_33:.*]] = gpu.wait async
	// CHECK: %[[VAL_34:.*]] = memref.dim %[[VAL_32]], %[[VAL_3]] : memref<?xf64>			// CHECK: %[[VAL_34:.*]] = memref.dim %[[VAL_32]], %[[VAL_3]] : memref<?xf64>
	// CHECK: %[[VAL_35:.]], %[[VAL_36:.]] = gpu.alloc async {{\[}}%[[VAL_33]]] (%[[VAL_34]]) : memref<?xf64>			// CHECK: %[[VAL_35:.]], %[[VAL_36:.]] = gpu.alloc async {{\[}}%[[VAL_33]]] (%[[VAL_34]]) : memref<?xf64>
	// CHECK: %[[VAL_37:.*]] = gpu.memcpy async {{\[}}%[[VAL_36]]] %[[VAL_35]], %[[VAL_32]] : memref<?xf64>, memref<?xf64>			// CHECK: %[[VAL_37:.*]] = gpu.memcpy async {{\[}}%[[VAL_36]]] %[[VAL_35]], %[[VAL_32]] : memref<?xf64>, memref<?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_15]], %[[VAL_20]], %[[VAL_25]], %[[VAL_31]], %[[VAL_37]]]			// CHECK: gpu.wait {{\[}}%[[VAL_15]], %[[VAL_20]], %[[VAL_25]], %[[VAL_31]], %[[VAL_37]]]
	// CHECK: %[[VAL_38:.*]] = gpu.wait async			// CHECK: %[[VAL_38:.*]] = gpu.wait async
	// CHECK: %[[VAL_39:.]], %[[VAL_40:.]] = gpu.create_sparse_env async {{\[}}%[[VAL_38]]]			// CHECK: %[[VAL_40:.*]] = gpu.create_sparse_env async {{\[}}%[[VAL_38]]] CUSPARSE
	// CHECK: %[[VAL_41:.]], %[[VAL_42:.]] = gpu.create_coo async {{\[}}%[[VAL_40]]] %[[VAL_6]], %[[VAL_7]], %[[VAL_5]], %[[VAL_13]], %[[VAL_18]], %[[VAL_23]] : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %[[VAL_41:.]], %[[VAL_42:.]] = gpu.create_coo async {{\[}}%[[VAL_40]]] %[[VAL_6]], %[[VAL_7]], %[[VAL_5]], %[[VAL_13]], %[[VAL_18]], %[[VAL_23]] : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %[[VAL_43:.]], %[[VAL_44:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_42]]] %[[VAL_39:.*]], %[[VAL_29]], %[[VAL_7]] : index into memref<?xf64>			// CHECK: %[[VAL_43:.]], %[[VAL_44:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_42]]] %[[VAL_29]], %[[VAL_7]] : index into memref<?xf64>
	// CHECK: %[[VAL_45:.]], %[[VAL_46:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_44]]] %[[VAL_39:.*]], %[[VAL_35]], %[[VAL_6]] : index into memref<?xf64>			// CHECK: %[[VAL_45:.]], %[[VAL_46:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_44]]] %[[VAL_35]], %[[VAL_6]] : index into memref<?xf64>
	// CHECK: %[[VAL_47:.]], %[[VAL_48:.]] = gpu.spmv_buffer_size async {{\[}}%[[VAL_46]]] %[[VAL_39]], %[[VAL_41]], %[[VAL_43]], %[[VAL_45]]			// CHECK: %[[VAL_47:.]], %[[VAL_48:.]] = gpu.spmv_buffer_size async {{\[}}%[[VAL_46]]] %[[VAL_41]], %[[VAL_43]], %[[VAL_45]]
				aartbikUnsubmitted Done Reply Inline Actions don't aartbik: don't
	// CHECK: %[[VAL_49:.]], %[[VAL_50:.]] = gpu.alloc async {{\[}}%[[VAL_48]]] (%[[VAL_47]]) : memref<?xi8>			// CHECK: %[[VAL_49:.]], %[[VAL_50:.]] = gpu.alloc async {{\[}}%[[VAL_48]]] (%[[VAL_47]]) : memref<?xi8>
	// CHECK: %[[VAL_51:.*]] = gpu.spmv async {{\[}}%[[VAL_50]]] %[[VAL_39]], %[[VAL_41]], %[[VAL_43]], %[[VAL_45]], %[[VAL_49]] : memref<?xi8>			// CHECK: %[[VAL_51:.*]] = gpu.spmv async {{\[}}%[[VAL_50]]] %[[VAL_41]], %[[VAL_43]], %[[VAL_45]], %[[VAL_49]] : memref<?xi8>
	// CHECK: %[[VAL_52:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_51]]] %[[VAL_41]]			// CHECK: %[[VAL_52:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_51]]] %[[VAL_41]]
	// CHECK: %[[VAL_53:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_52]]] %[[VAL_43]]			// CHECK: %[[VAL_53:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_52]]] %[[VAL_43]]
	// CHECK: %[[VAL_54:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_53]]] %[[VAL_45]]			// CHECK: %[[VAL_54:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_53]]] %[[VAL_45]]
	// CHECK: %[[VAL_55:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_54]]] %[[VAL_39]]			// CHECK: %[[VAL_55:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_54]]] CUSPARSE
	// CHECK: %[[VAL_56:.*]] = gpu.dealloc async {{\[}}%[[VAL_55]]] %[[VAL_13]] : memref<?xindex>			// CHECK: %[[VAL_56:.*]] = gpu.dealloc async {{\[}}%[[VAL_55]]] %[[VAL_13]] : memref<?xindex>
	// CHECK: %[[VAL_57:.*]] = gpu.dealloc async {{\[}}%[[VAL_56]]] %[[VAL_18]] : memref<?xindex>			// CHECK: %[[VAL_57:.*]] = gpu.dealloc async {{\[}}%[[VAL_56]]] %[[VAL_18]] : memref<?xindex>
	// CHECK: %[[VAL_58:.*]] = gpu.dealloc async {{\[}}%[[VAL_57]]] %[[VAL_23]] : memref<?xf64>			// CHECK: %[[VAL_58:.*]] = gpu.dealloc async {{\[}}%[[VAL_57]]] %[[VAL_23]] : memref<?xf64>
	// CHECK: %[[VAL_59:.*]] = gpu.dealloc async {{\[}}%[[VAL_58]]] %[[VAL_49]] : memref<?xi8>			// CHECK: %[[VAL_59:.*]] = gpu.dealloc async {{\[}}%[[VAL_58]]] %[[VAL_49]] : memref<?xi8>
	// CHECK: %[[VAL_60:.*]] = gpu.dealloc async {{\[}}%[[VAL_59]]] %[[VAL_29]] : memref<?xf64>			// CHECK: %[[VAL_60:.*]] = gpu.dealloc async {{\[}}%[[VAL_59]]] %[[VAL_29]] : memref<?xf64>
	// CHECK: %[[VAL_61:.*]] = gpu.memcpy async {{\[}}%[[VAL_60]]] %[[VAL_32]], %[[VAL_35]] : memref<?xf64>, memref<?xf64>			// CHECK: %[[VAL_61:.*]] = gpu.memcpy async {{\[}}%[[VAL_60]]] %[[VAL_32]], %[[VAL_35]] : memref<?xf64>, memref<?xf64>
	// CHECK: %[[VAL_62:.*]] = gpu.dealloc async {{\[}}%[[VAL_61]]] %[[VAL_35]] : memref<?xf64>			// CHECK: %[[VAL_62:.*]] = gpu.dealloc async {{\[}}%[[VAL_61]]] %[[VAL_35]] : memref<?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_62]]]			// CHECK: gpu.wait {{\[}}%[[VAL_62]]]
	Show All 13 Lines

mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir

	Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	// CHECK: %[[VAL_26:.]], %[[VAL_27:.]] = gpu.alloc async {{\[}}%[[VAL_24]]] (%[[VAL_25]]) : memref<?xindex>			// CHECK: %[[VAL_26:.]], %[[VAL_27:.]] = gpu.alloc async {{\[}}%[[VAL_24]]] (%[[VAL_25]]) : memref<?xindex>
	// CHECK: %[[VAL_28:.*]] = gpu.memcpy async {{\[}}%[[VAL_27]]] %[[VAL_26]], %[[VAL_17]] : memref<?xindex>, memref<?xindex>			// CHECK: %[[VAL_28:.*]] = gpu.memcpy async {{\[}}%[[VAL_27]]] %[[VAL_26]], %[[VAL_17]] : memref<?xindex>, memref<?xindex>
	// CHECK: %[[VAL_29:.*]] = gpu.wait async			// CHECK: %[[VAL_29:.*]] = gpu.wait async
	// CHECK: %[[VAL_30:.*]] = memref.dim %[[VAL_18]], %[[VAL_4]] : memref<?xf64>			// CHECK: %[[VAL_30:.*]] = memref.dim %[[VAL_18]], %[[VAL_4]] : memref<?xf64>
	// CHECK: %[[VAL_31:.]], %[[VAL_32:.]] = gpu.alloc async {{\[}}%[[VAL_29]]] (%[[VAL_30]]) : memref<?xf64>			// CHECK: %[[VAL_31:.]], %[[VAL_32:.]] = gpu.alloc async {{\[}}%[[VAL_29]]] (%[[VAL_30]]) : memref<?xf64>
	// CHECK: %[[VAL_33:.*]] = gpu.memcpy async {{\[}}%[[VAL_32]]] %[[VAL_31]], %[[VAL_18]] : memref<?xf64>, memref<?xf64>			// CHECK: %[[VAL_33:.*]] = gpu.memcpy async {{\[}}%[[VAL_32]]] %[[VAL_31]], %[[VAL_18]] : memref<?xf64>, memref<?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_10]], %[[VAL_15]], %[[VAL_23]], %[[VAL_28]], %[[VAL_33]]]			// CHECK: gpu.wait {{\[}}%[[VAL_10]], %[[VAL_15]], %[[VAL_23]], %[[VAL_28]], %[[VAL_33]]]
	// CHECK: %[[VAL_34:.*]] = gpu.wait async			// CHECK: %[[VAL_34:.*]] = gpu.wait async
	// CHECK: %[[VAL_35:.]], %[[VAL_36:.]] = gpu.create_sparse_env async {{\[}}%[[VAL_34]]]			// CHECK: %[[VAL_36:.*]] = gpu.create_sparse_env async {{\[}}%[[VAL_34]]] CUSPARSE
	// CHECK: %[[VAL_37:.]], %[[VAL_38:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_36]]] %[[VAL_35]], %[[VAL_8]], %[[VAL_3]], %[[VAL_3]] : index, index into memref<8x8xf64>			// CHECK: %[[VAL_37:.]], %[[VAL_38:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_36]]] %[[VAL_8]], %[[VAL_3]], %[[VAL_3]] : index, index into memref<8x8xf64>
	// CHECK: %[[VAL_39:.]], %[[VAL_40:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_38]]] %[[VAL_35]], %[[VAL_13]], %[[VAL_3]], %[[VAL_3]] : index, index into memref<8x8xf64>			// CHECK: %[[VAL_39:.]], %[[VAL_40:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_38]]] %[[VAL_13]], %[[VAL_3]], %[[VAL_3]] : index, index into memref<8x8xf64>
	// CHECK: %[[VAL_41:.]], %[[VAL_42:.]] = gpu.create_csr async {{\[}}%[[VAL_40]]] %[[VAL_3]], %[[VAL_3]], %[[VAL_5]], %[[VAL_21]], %[[VAL_26]], %[[VAL_31]] : memref<?xindex>, memref<?xindex>, memref<?xf64>			// CHECK: %[[VAL_41:.]], %[[VAL_42:.]] = gpu.create_csr async {{\[}}%[[VAL_40]]] %[[VAL_3]], %[[VAL_3]], %[[VAL_5]], %[[VAL_21]], %[[VAL_26]], %[[VAL_31]] : memref<?xindex>, memref<?xindex>, memref<?xf64>
	// CHECK: %[[VAL_43:.]], %[[VAL_44:.]] = gpu.sddmm_buffer_size async {{\[}}%[[VAL_42]]] %[[VAL_35]], %[[VAL_37]], %[[VAL_39]], %[[VAL_41]] into f64			// CHECK: %[[VAL_43:.]], %[[VAL_44:.]] = gpu.sddmm_buffer_size async {{\[}}%[[VAL_42]]] %[[VAL_37]], %[[VAL_39]], %[[VAL_41]] into f64
	// CHECK: %[[VAL_45:.]], %[[VAL_46:.]] = gpu.alloc async {{\[}}%[[VAL_44]]] (%[[VAL_43]]) : memref<?xi8>			// CHECK: %[[VAL_45:.]], %[[VAL_46:.]] = gpu.alloc async {{\[}}%[[VAL_44]]] (%[[VAL_43]]) : memref<?xi8>
	// CHECK: %[[VAL_47:.*]] = gpu.sddmm async {{\[}}%[[VAL_46]]] %[[VAL_35]], %[[VAL_37]], %[[VAL_39]], %[[VAL_41]], %[[VAL_45]] : memref<?xi8> into f64			// CHECK: %[[VAL_47:.*]] = gpu.sddmm async {{\[}}%[[VAL_46]]] %[[VAL_37]], %[[VAL_39]], %[[VAL_41]], %[[VAL_45]] : memref<?xi8> into f64
	// CHECK: %[[VAL_48:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_47]]] %[[VAL_37]]			// CHECK: %[[VAL_48:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_47]]] %[[VAL_37]]
	// CHECK: %[[VAL_49:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_48]]] %[[VAL_39]]			// CHECK: %[[VAL_49:.*]] = gpu.destroy_dn_tensor async {{\[}}%[[VAL_48]]] %[[VAL_39]]
	// CHECK: %[[VAL_50:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_49]]] %[[VAL_41]]			// CHECK: %[[VAL_50:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_49]]] %[[VAL_41]]
	// CHECK: %[[VAL_51:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_50]]] %[[VAL_35]]			// CHECK: %[[VAL_51:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_50]]] CUSPARSE
	// CHECK: %[[VAL_52:.*]] = gpu.dealloc async {{\[}}%[[VAL_51]]] %[[VAL_45]] : memref<?xi8>			// CHECK: %[[VAL_52:.*]] = gpu.dealloc async {{\[}}%[[VAL_51]]] %[[VAL_45]] : memref<?xi8>
	// CHECK: %[[VAL_53:.*]] = gpu.dealloc async {{\[}}%[[VAL_52]]] %[[VAL_8]] : memref<8x8xf64>			// CHECK: %[[VAL_53:.*]] = gpu.dealloc async {{\[}}%[[VAL_52]]] %[[VAL_8]] : memref<8x8xf64>
	// CHECK: %[[VAL_54:.*]] = gpu.dealloc async {{\[}}%[[VAL_53]]] %[[VAL_13]] : memref<8x8xf64>			// CHECK: %[[VAL_54:.*]] = gpu.dealloc async {{\[}}%[[VAL_53]]] %[[VAL_13]] : memref<8x8xf64>
	// CHECK: %[[VAL_55:.*]] = gpu.dealloc async {{\[}}%[[VAL_54]]] %[[VAL_21]] : memref<?xindex>			// CHECK: %[[VAL_55:.*]] = gpu.dealloc async {{\[}}%[[VAL_54]]] %[[VAL_21]] : memref<?xindex>
	// CHECK: %[[VAL_56:.*]] = gpu.dealloc async {{\[}}%[[VAL_55]]] %[[VAL_26]] : memref<?xindex>			// CHECK: %[[VAL_56:.*]] = gpu.dealloc async {{\[}}%[[VAL_55]]] %[[VAL_26]] : memref<?xindex>
	// CHECK: %[[VAL_57:.*]] = gpu.memcpy async {{\[}}%[[VAL_56]]] %[[VAL_18]], %[[VAL_31]] : memref<?xf64>, memref<?xf64>			// CHECK: %[[VAL_57:.*]] = gpu.memcpy async {{\[}}%[[VAL_56]]] %[[VAL_18]], %[[VAL_31]] : memref<?xf64>, memref<?xf64>
	// CHECK: %[[VAL_58:.*]] = gpu.dealloc async {{\[}}%[[VAL_57]]] %[[VAL_31]] : memref<?xf64>			// CHECK: %[[VAL_58:.*]] = gpu.dealloc async {{\[}}%[[VAL_57]]] %[[VAL_31]] : memref<?xf64>
	// CHECK: gpu.wait {{\[}}%[[VAL_58]]]			// CHECK: gpu.wait {{\[}}%[[VAL_58]]]
	Show All 32 Lines

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir

Show All 22 Lines	func.func @sampled_matmul(%a : memref<16x32xf16>,
%c1048576 = arith.constant 1048576 : index		%c1048576 = arith.constant 1048576 : index
%token0 = gpu.wait async		%token0 = gpu.wait async
%d_a, %token1 = gpu.alloc async [%token0] () : memref<16x32xf16>		%d_a, %token1 = gpu.alloc async [%token0] () : memref<16x32xf16>
%d_b, %token2 = gpu.alloc async [%token1] () : memref<32x16xf16>		%d_b, %token2 = gpu.alloc async [%token1] () : memref<32x16xf16>
%d_c, %token3 = gpu.alloc async [%token2] () : memref<16x16xf16>		%d_c, %token3 = gpu.alloc async [%token2] () : memref<16x16xf16>
%token4 = gpu.memcpy async [%token3] %d_a, %a : memref<16x32xf16>, memref<16x32xf16>		%token4 = gpu.memcpy async [%token3] %d_a, %a : memref<16x32xf16>, memref<16x32xf16>
%token5 = gpu.memcpy async [%token4] %d_b, %b : memref<32x16xf16>, memref<32x16xf16>		%token5 = gpu.memcpy async [%token4] %d_b, %b : memref<32x16xf16>, memref<32x16xf16>
%token6 = gpu.memcpy async [%token5] %d_c, %c : memref<16x16xf16>, memref<16x16xf16>		%token6 = gpu.memcpy async [%token5] %d_c, %c : memref<16x16xf16>, memref<16x16xf16>
%env, %token7 = gpu.create_sparse_env async [%token6]		%token7 = gpu.create_sparse_env async [%token6] CUSPARSE
%spmat, %token8 = gpu.create_2to4_spmat async [%token7] %env, %c16, %c32, %d_a: memref<16x32xf16>		%spmat, %token8 = gpu.create_2to4_spmat async [%token7] %c16, %c32, %d_a: memref<16x32xf16>
%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %env, %d_b, %c32, %c16: index, index into memref<32x16xf16>		%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %d_b, %c32, %c16: index, index into memref<32x16xf16>
%dnmat2, %token10 = gpu.create_dn_tensor async [%token9] %env, %d_c, %c16, %c16: index, index into memref<16x16xf16>		%dnmat2, %token10 = gpu.create_dn_tensor async [%token9] %d_c, %c16, %c16: index, index into memref<16x16xf16>
%bufferSz0, %bufferSz1, %bufferSz2, %token11 = gpu.spmm_buffer_size async [%token10] %env, %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2 : index, index,index into f16		%bufferSz0, %bufferSz1, %bufferSz2, %token11 = gpu.spmm_buffer_size async [%token10] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2 : index, index,index into f16
		aartbikUnsubmitted Done Reply Inline Actions can we move this all the way up to main() just as illustration of what a sparse compiler should generate? aartbik: can we move this all the way up to main() just as illustration of what a sparse compiler should…
		K-WuAuthorUnsubmitted Done Reply Inline Actions Good point! I am working on it. K-Wu: Good point! I am working on it.
%mem1, %token12 = gpu.alloc async [%token11] (%bufferSz0) : memref<?xf16>		%mem1, %token12 = gpu.alloc async [%token11] (%bufferSz0) : memref<?xf16>
%mem2, %token13 = gpu.alloc async [%token12] (%bufferSz1) : memref<?xf16>		%mem2, %token13 = gpu.alloc async [%token12] (%bufferSz1) : memref<?xf16>
%mem3, %token14 = gpu.alloc async [%token13] (%bufferSz2) : memref<?xf16>		%mem3, %token14 = gpu.alloc async [%token13] (%bufferSz2) : memref<?xf16>
%token15 = gpu.spmm async [%token14] %env, %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2, %mem1, %mem2, %mem3 : memref<?xf16>, memref<?xf16>,memref<?xf16> into f16		%token15 = gpu.spmm async [%token14] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2, %mem1, %mem2, %mem3 : memref<?xf16>, memref<?xf16>,memref<?xf16> into f16
%token16 = gpu.destroy_sp_mat async [%token15] %spmat		%token16 = gpu.destroy_sp_mat async [%token15] %spmat
%token17 = gpu.destroy_dn_tensor async [%token16] %dnmat		%token17 = gpu.destroy_dn_tensor async [%token16] %dnmat
%token18 = gpu.destroy_sparse_env async [%token17] %env		%token18 = gpu.destroy_sparse_env async [%token17] CUSPARSE
%token19 = gpu.memcpy async [%token18] %c, %d_c : memref<16x16xf16>, memref<16x16xf16>		%token19 = gpu.memcpy async [%token18] %c, %d_c : memref<16x16xf16>, memref<16x16xf16>
%token20 = gpu.dealloc async [%token19] %d_c : memref<16x16xf16>		%token20 = gpu.dealloc async [%token19] %d_c : memref<16x16xf16>
%token21 = gpu.dealloc async [%token20] %d_b : memref<32x16xf16>		%token21 = gpu.dealloc async [%token20] %d_b : memref<32x16xf16>
%token22 = gpu.dealloc async [%token21] %d_a : memref<16x32xf16>		%token22 = gpu.dealloc async [%token21] %d_a : memref<16x32xf16>
%token23 = gpu.dealloc async [%token22] %mem3 : memref<?xf16>		%token23 = gpu.dealloc async [%token22] %mem3 : memref<?xf16>
%token24 = gpu.dealloc async [%token23] %mem2 : memref<?xf16>		%token24 = gpu.dealloc async [%token23] %mem2 : memref<?xf16>
%token25 = gpu.dealloc async [%token24] %mem1 : memref<?xf16>		%token25 = gpu.dealloc async [%token24] %mem1 : memref<?xf16>
gpu.wait [%token25]		gpu.wait [%token25]
▲ Show 20 Lines • Show All 181 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] rework CUDA sparse libs environment handleClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 535568

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir

mlir/test/Conversion/GPUCommon/lower-sparse-to-gpu-runtime-calls.mlir

mlir/test/Dialect/GPU/ops.mlir

mlir/test/Dialect/GPU/sparse-roundtrip.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul_lib.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec_lib.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir

[mlir][sparse][gpu] rework CUDA sparse libs environment handle
ClosedPublic