Diff 546160

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 1,707 Lines • ▼ Show 20 Lines	def GPU_CreateCsrOp : GPU_Op<"create_csr", [GPU_AsyncOpInterface]> {

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$rows `,` $cols `,` $nnz `,` $rowPos `,` $colIdxs `,` $values attr-dict		$rows `,` $cols `,` $nnz `,` $rowPos `,` $colIdxs `,` $values attr-dict
`:` type($rowPos) `,` type($colIdxs) `,` type($values)		`:` type($rowPos) `,` type($colIdxs) `,` type($values)
}];		}];
}		}

		def GPU_Prune2To4SpMatFlag : I32EnumAttr<"Prune2To4SpMatFlag",
		"pruning strategy for 2:4 sparse matrix",
		aartbikUnsubmitted Done Reply Inline Actions This is a bit nitpicky, but most enums have very short descriptions in this string field, and typically no verb form So "pruning strategy for 2:4 sparse matrix" would look a bit better here aartbik: This is a bit nitpicky, but most enums have very short descriptions in this string field, and…
		[
		I32EnumAttrCase<"NONE", 0>,
		I32EnumAttrCase<"PRUNE_ONLY", 1>,
		I32EnumAttrCase<"PRUNE_AND_CHECK", 2>,
		]> {
		let genSpecializedAttr = 0;
		aartbikUnsubmitted Done Reply Inline Actions these let's should be indented two less (move to left) aartbik: these let's should be indented two less (move to left)
		let cppNamespace = GPU_Dialect.cppNamespace;
		}

		def GPU_Prune2To4SpMatFlagAttr : EnumAttr<GPU_Dialect, GPU_Prune2To4SpMatFlag,
		"prune_2to4_spmat_flag">{
		let defaultValue = "Prune2To4SpMatFlag::PRUNE_AND_CHECK";
		}


def GPU_Create2To4SpMatOp : GPU_Op<"create_2to4_spmat", [GPU_AsyncOpInterface]> {		def GPU_Create2To4SpMatOp : GPU_Op<"create_2to4_spmat", [GPU_AsyncOpInterface]> {
let summary = "Create sparse matrix with 2:4 sparsity operation";		let summary = "Create sparse matrix with 2:4 sparsity operation";
let description = [{		let description = [{
The `gpu.create_2to4_spmat` operation initializes a sparse matrix in dense		The `gpu.create_2to4_spmat` operation initializes a sparse matrix in dense
format with 2:4 sparsity.		format with 2:4 sparsity.
The buffers must already be copied from the host to the device prior to		The buffers must already be copied from the host to the device prior to
using this operation. The operation returns a handle to the sparse		using this operation. The operation returns a handle to the sparse
matrix descriptor.		matrix descriptor.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it returns a !gpu.async.token in addition to the environment.		that case, it returns a !gpu.async.token in addition to the environment.

Example:		Example:

```mlir		```mlir
%spmat, %token = gpu.create_2to4_spmat async [%dep] %rows, %cols, %mem : memref<?xf64>		%spmat, %token = gpu.create_2to4_spmat async [%dep] {PRUNE_AND_CHECK} %rows, %cols, %mem: memref<?xf64>
		aartbikUnsubmitted Done Reply Inline Actions nitpick, having the enum flag as a regular operand (viz. comma argument) feels a bit out of place, we typically use different syntax for that can we e.g. put it before the [] to make it stand out? %spmat, %token = gpu.create_2to4_spmat async PRUNE_AND_CHECK [%dep] %rows, %cols, %mem : memref<?xf64> I am open for other suggestions, but the comma argument seems not quite right aartbik: nitpick, having the enum flag as a regular operand (viz. comma argument) feels a bit out of…
		wrengrUnsubmitted Done Reply Inline Actions I agree: since the `$pruneFlag` is an attribute rather than a value, it should either be (a) part of the `attr-dict`, or (b) use some other syntax that helps it stand out from the value arguments. I'm less sure about putting it before the `$asyncDependencies`, just because elsewhere we tend to reserve that spot for the `$asyncToken`; though if we do put the `$pruneFlag` there, I think it should come before the `$asyncToken` so that the `$asyncToken` and `$asyncDependencies` can remain adjacent wrengr: I agree: since the `$pruneFlag` is an attribute rather than a value, it should either be (a)…
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
Index:$rows,		Index:$rows,
Index:$cols,		Index:$cols,
		GPU_Prune2To4SpMatFlagAttr:$pruneFlag,
AnyMemRef:$memref);		AnyMemRef:$memref);
let results = (outs Res<GPU_SparseSpMatHandle>:$spMat,		let results = (outs Res<GPU_SparseSpMatHandle>:$spMat,
Optional<GPU_AsyncToken>:$asyncToken);		Optional<GPU_AsyncToken>:$asyncToken);

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$rows `,` $cols `,` $memref attr-dict `:` type($memref)		`{` $pruneFlag `}` $rows `,` $cols `,` $memref attr-dict `:` type($memref)
}];		}];
}		}

def GPU_DestroySpMatOp : GPU_Op<"destroy_sp_mat", [GPU_AsyncOpInterface]> {		def GPU_DestroySpMatOp : GPU_Op<"destroy_sp_mat", [GPU_AsyncOpInterface]> {
let summary = "Destroy sparse matrix operation";		let summary = "Destroy sparse matrix operation";
let description = [{		let description = [{
The `gpu.destroy_sp_mat` operation releases all resources of a sparse		The `gpu.destroy_sp_mat` operation releases all resources of a sparse
matrix represented by a handle that was previously created by a		matrix represented by a handle that was previously created by a
▲ Show 20 Lines • Show All 360 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	FunctionCallBuilder create2To4SpMatCallBuilder = {
"mgpuCusparseLtCreate2To4SpMat",		"mgpuCusparseLtCreate2To4SpMat",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmIntPtrType, llvmIntPtrType, llvmPointerType,		{llvmPointerType, llvmIntPtrType, llvmIntPtrType, llvmPointerType,
llvmInt32Type, llvmPointerType /* void stream /}};		llvmInt32Type, llvmPointerType /* void stream /}};
FunctionCallBuilder createCuSparseLtSpMMBufferSizeBuilder = {		FunctionCallBuilder createCuSparseLtSpMMBufferSizeBuilder = {
"mgpuCuSparseLtSpMMBufferSize",		"mgpuCuSparseLtSpMMBufferSize",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,		{llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType,
llvmPointerType, llvmPointerType, llvmInt32Type,		llvmPointerType, llvmPointerType, llvmInt32Type, llvmInt32Type,
llvmPointerType /void stream*/}};		llvmPointerType /void stream*/}};
		aartbikUnsubmitted Done Reply Inline Actions we actually don't document any other flag (except for stream that always is last) can we make this {llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType, llvmPointerType, llvmPointerType, llvmInt32Type, llvmInt32Type, llvmPointerType /void stream/}}; instead (and yes, perhaps we should document all parameters at one point, but now it feels inconsistent) aartbik:* we actually don't document any other flag (except for stream that always is last) can we make…
FunctionCallBuilder createCuSparseLtSpMMBuilder = {		FunctionCallBuilder createCuSparseLtSpMMBuilder = {
"mgpuCuSparseLtSpMM",		"mgpuCuSparseLtSpMM",
llvmVoidType,		llvmVoidType,
{llvmPointerType, llvmPointerType, llvmPointerType, llvmPointerType,		{llvmPointerType, llvmPointerType, llvmPointerType, llvmPointerType,
llvmPointerType, llvmPointerType, llvmPointerType /void stream*/}};		llvmPointerType, llvmPointerType, llvmPointerType /void stream*/}};
};		};

/// A rewrite pattern to convert gpu.host_register operations into a GPU runtime		/// A rewrite pattern to convert gpu.host_register operations into a GPU runtime
▲ Show 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	static int32_t getCuSparseDataTypeFrom(Type type) {
if (type.isInteger(16))		if (type.isInteger(16))
return 20; // CUDA_R_16I		return 20; // CUDA_R_16I
if (type.isInteger(32))		if (type.isInteger(32))
return 10; // CUDA_R_32I		return 10; // CUDA_R_32I

llvm_unreachable("unsupported element type");		llvm_unreachable("unsupported element type");
}		}

		static gpu::Prune2To4SpMatFlag get2To4PruneFlag(Value spMat) {
		return spMat.getDefiningOp<gpu::Create2To4SpMatOp>().getPruneFlag();
		}
		wrengrUnsubmitted Done Reply Inline Actions Nit: Why not just combine these into one line? wrengr: Nit: Why not just combine these into one line?
		aartbikUnsubmitted Done Reply Inline Actions +1 aartbik: +1
// TODO: We may want a run-time (of the mlir compiler) disablement/warning:		// TODO: We may want a run-time (of the mlir compiler) disablement/warning:
// cusparseLt currently won't work for cuda architecture <8.0 and will trigger a		// cusparseLt currently won't work for cuda architecture <8.0 and will trigger a
// runtime (of the CUDA program) error , but it might be great if we could at		// runtime (of the CUDA program) error , but it might be great if we could at
// least output a warning when we found the target architecture is <8.0 and the		// least output a warning when we found the target architecture is <8.0 and the
// user still wants to use cusparseLt. to make sure when lowering gpu sparse		// user still wants to use cusparseLt. to make sure when lowering gpu sparse
// dialect to llvm calls, the cusparselt calls are disabled for cuda		// dialect to llvm calls, the cusparselt calls are disabled for cuda
// architecture <8.0		// architecture <8.0
static bool is2To4Sparsity(Value spMat) {		static bool is2To4Sparsity(Value spMat) {
▲ Show 20 Lines • Show All 865 Lines • ▼ Show 20 Lines	if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)) \|\|
failed(isAsyncWithOneDependency(rewriter, op)))		failed(isAsyncWithOneDependency(rewriter, op)))
return failure();		return failure();
Location loc = op.getLoc();		Location loc = op.getLoc();
auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());		auto modeA = genConstInt32From(rewriter, loc, adaptor.getModeA());
auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());		auto modeB = genConstInt32From(rewriter, loc, adaptor.getModeB());
auto stream = adaptor.getAsyncDependencies().front();		auto stream = adaptor.getAsyncDependencies().front();
Value bufferSize;		Value bufferSize;
if (is2To4Sparsity(op.getSpmatA())) {		if (is2To4Sparsity(op.getSpmatA())) {
		auto prune_flag =
		genConstInt32From(rewriter, loc, get2To4PruneFlag(op.getSpmatA()));
auto computeType = genConstInt32From(		auto computeType = genConstInt32From(
rewriter, loc, getCuSparseLtDataTypeFrom(adaptor.getComputeType()));		rewriter, loc, getCuSparseLtDataTypeFrom(adaptor.getComputeType()));
auto three = rewriter.create<LLVM::ConstantOp>(loc, getIndexType(),		auto three = rewriter.create<LLVM::ConstantOp>(loc, getIndexType(),
rewriter.getIndexAttr(3));		rewriter.getIndexAttr(3));
auto bufferSize = rewriter.create<LLVM::AllocaOp>(		auto bufferSize = rewriter.create<LLVM::AllocaOp>(
loc, llvmInt64PointerType, llvmInt64Type, three, /alignment=/16);		loc, llvmInt64PointerType, llvmInt64Type, three, /alignment=/16);
createCuSparseLtSpMMBufferSizeBuilder		createCuSparseLtSpMMBufferSizeBuilder
.create(loc, rewriter,		.create(loc, rewriter,
{bufferSize, modeA, modeB, adaptor.getSpmatA(),		{bufferSize, modeA, modeB, adaptor.getSpmatA(),
adaptor.getDnmatB(), adaptor.getDnmatC(), computeType, stream})		adaptor.getDnmatB(), adaptor.getDnmatC(), computeType,
		prune_flag, stream})
		aartbikUnsubmitted Done Reply Inline Actions see above on parameter order aartbik: see above on parameter order
.getResult();		.getResult();

auto bufferSizePtr1 = rewriter.create<LLVM::GEPOp>(		auto bufferSizePtr1 = rewriter.create<LLVM::GEPOp>(
loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,		loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,
ValueRange{rewriter.create<LLVM::ConstantOp>(		ValueRange{rewriter.create<LLVM::ConstantOp>(
loc, getIndexType(), rewriter.getIndexAttr(1))});		loc, getIndexType(), rewriter.getIndexAttr(1))});
auto bufferSizePtr2 = rewriter.create<LLVM::GEPOp>(		auto bufferSizePtr2 = rewriter.create<LLVM::GEPOp>(
loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,		loc, llvmInt64PointerType, llvmInt64PointerType, bufferSize,
▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

Show First 20 Lines • Show All 686 Lines • ▼ Show 20 Lines	static LogicalResult rewrite2To4SpMM(PatternRewriter &rewriter,
Value szn = linalg::createOrFoldDimOp(rewriter, loc, matC, 1);		Value szn = linalg::createOrFoldDimOp(rewriter, loc, matC, 1);

Type indexTp = rewriter.getIndexType();		Type indexTp = rewriter.getIndexType();
Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();		Type dnTensorHandleTp = rewriter.getType<gpu::SparseDnTensorHandleType>();
Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();		Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();
Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();		Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();
Value token = genFirstWait(rewriter, loc);		Value token = genFirstWait(rewriter, loc);
Operation *spGenA = rewriter.create<gpu::Create2To4SpMatOp>(		Operation *spGenA = rewriter.create<gpu::Create2To4SpMatOp>(
loc, spMatHandleTp, tokenTp, token, szm, szk, matA);		loc, spMatHandleTp, tokenTp, token, szm, szk,
		gpu::Prune2To4SpMatFlag::PRUNE_AND_CHECK, matA);

Value spMatA = spGenA->getResult(0);		Value spMatA = spGenA->getResult(0);
token = spGenA->getResult(1);		token = spGenA->getResult(1);
auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(		auto dmatB = rewriter.create<gpu::CreateDnTensorOp>(
loc, dnTensorHandleTp, tokenTp, token, matB,		loc, dnTensorHandleTp, tokenTp, token, matB,
SmallVector<Value>{szk, szn});		SmallVector<Value>{szk, szn});
Value dnB = dmatB.getResult(0);		Value dnB = dmatB.getResult(0);
token = dmatB.getAsyncToken();		token = dmatB.getAsyncToken();
▲ Show 20 Lines • Show All 359 Lines • Show Last 20 Lines

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

Show First 20 Lines • Show All 624 Lines • ▼ Show 20 Lines
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuDestroyCuSparseLtSpMat(void sh, CUstream /stream*/) {		mgpuDestroyCuSparseLtSpMat(void sh, CUstream /stream*/) {
auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);		auto spmat_handle = reinterpret_cast<cusparseLtSpMatHandleAndData *>(sh);
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(spmat_handle->mat)))		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatDescriptorDestroy(&(spmat_handle->mat)))
}		}

// Several things are being done in this stage, algorithm selection, planning,		// Several things are being done in this stage, algorithm selection, planning,
// and returning workspace and compressed matrices data buffer sizes.		// and returning workspace and compressed matrices data buffer sizes.
		// The parameter prune_flag is used to indicate whether pruning and pruning
		aartbikUnsubmitted Done Reply Inline Actions To avoid starting the sentence with lower case, just say The parameter `prune_flag` is used.... Also, see above on order aartbik: To avoid starting the sentence with lower case, just say The parameter `prune_flag` is used....
		// check will happen 0 means not prune or prune check, 1 means prune, 2 means
		// prune & prune check
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuCuSparseLtSpMMBufferSize(void bs, int32_t ma, int32_t mb, void a, void *b,		mgpuCuSparseLtSpMMBufferSize(void bs, int32_t ma, int32_t mb, void a, void *b,
void *c, int32_t ctp, CUstream stream) {		void *c, int32_t ctp, int32_t prune_flag,
		CUstream stream) {
assert(cusparseLt_initiated && "client did not call mgpuCreateSparseLtEnv()");		assert(cusparseLt_initiated && "client did not call mgpuCreateSparseLtEnv()");
// TODO: support more advanced settings, e.g., the input right operand is a		// TODO: support more advanced settings, e.g., the input right operand is a
// sparse matrix assuming matA is the sparse matrix		// sparse matrix assuming matA is the sparse matrix
auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);		auto matA = reinterpret_cast<cusparseLtSpMatHandleAndData *>(a);
auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);		auto matB = reinterpret_cast<cusparseLtDnMatHandleAndData *>(b);
auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);		auto matC = reinterpret_cast<cusparseLtDnMatHandleAndData *>(c);
auto workspace_size = reinterpret_cast<int64_t *>(bs);		auto workspace_size = reinterpret_cast<int64_t *>(bs);
auto compressed_size = &(reinterpret_cast<int64_t *>(bs)[1]);		auto compressed_size = &(reinterpret_cast<int64_t *>(bs)[1]);
Show All 13 Lines	mgpuCuSparseLtSpMMBufferSize(void bs, int32_t ma, int32_t mb, void a, void *b,
CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulAlgSetAttribute(
&cusparseLt_env, &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,		&cusparseLt_env, &(matA->alg_sel), CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg,
sizeof(alg)))		sizeof(alg)))

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulPlanInit(
&cusparseLt_env, &(matA->plan), &(matA->matmul), &(matA->alg_sel)))		&cusparseLt_env, &(matA->plan), &(matA->matmul), &(matA->alg_sel)))

// Pruning step (in-place).		// Pruning step (in-place).
CUSPARSE_REPORT_IF_ERROR(		if (prune_flag > 0)
cusparseLtSpMMAPrune(&cusparseLt_env, &(matA->matmul), matA->values,		CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMAPrune(
matA->values, CUSPARSELT_PRUNE_SPMMA_STRIP, stream))		&cusparseLt_env, &(matA->matmul), matA->values, matA->values,
		CUSPARSELT_PRUNE_SPMMA_STRIP, stream))
		wrengrUnsubmitted Not Done Reply Inline Actions Personally I think this (and the below `if(prune_flag==2)`) would read a lot cleaner if they were phrased in terms of `Prune2To4SpMatFlag` instead of `int32_t`. Fwiw, I know why you did it this way. I ran into the same issue about sharing enums between the compiler and the executionengine for the sparse-tensor runtime library, and fixing that takes a fair deal of boilerplate. Nevertheless it's unfortunate that tblgen'ed enums run into this problem. I'm not suggesting you change it for this CL; rather, going forward you may want to consider factoring out an enums-library to be shared between the compiler and executionengine, since I expect the CUDA wrappers have a whole bunch of this sort of thing. If you want to do that, drop me a line; there are a number of gotchas on the cmake side of things, which I worked out for the sparse-tensor RT and can factor out into a reusable cmake-function wrengr: Personally I think this (and the below `if(prune_flag==2)`) would read a lot cleaner if they…
		aartbikUnsubmitted Not Done Reply Inline Actions +1 on both the sentiment, but also to leave this for a later CL aartbik: +1 on both the sentiment, but also to leave this for a later CL

// Check structure of A.		// Check structure of A.
// Note that this adds a synchronization on the stream.		// Note that this adds a synchronization on the stream.
// TODO: Do we want that?		// TODO: Do we want that?
		if (prune_flag == 2) {
int dvalid = (int )mgpuMemAlloc(sizeof(int), stream);		int dvalid = (int )mgpuMemAlloc(sizeof(int), stream);
CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMAPruneCheck(		CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMAPruneCheck(
&cusparseLt_env, &(matA->matmul), matA->values, dvalid, stream))		&cusparseLt_env, &(matA->matmul), matA->values, dvalid, stream))
int valid = 0;		int valid = 0;
mgpuMemcpy(&valid, dvalid, sizeof(int), stream);		mgpuMemcpy(&valid, dvalid, sizeof(int), stream);
mgpuStreamSynchronize(stream);		mgpuStreamSynchronize(stream);
mgpuMemFree(dvalid, stream);		mgpuMemFree(dvalid, stream);
if (valid != 0)		if (valid != 0)
fprintf(stderr, "CUPARSE-LT: sparse matrix is not 2:4; computed results "		fprintf(stderr, "CUPARSE-LT: sparse matrix is not 2:4; computed results "
"will be invalid\n");		"will be invalid\n");
		}

CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulGetWorkspace(		CUSPARSE_REPORT_IF_ERROR(cusparseLtMatmulGetWorkspace(
&cusparseLt_env, &(matA->plan), &workspace_size_))		&cusparseLt_env, &(matA->plan), &workspace_size_))
CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(		CUSPARSE_REPORT_IF_ERROR(cusparseLtSpMMACompressedSize(
&cusparseLt_env, &(matA->plan), &compressed_size_,		&cusparseLt_env, &(matA->plan), &compressed_size_,
&compressed_buffer_size_))		&compressed_buffer_size_))

// Avoid zero-allocation.		// Avoid zero-allocation.
Show All 34 Lines

mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir

Show All 12 Lines	module attributes {gpu.container_module} {
// CHECK: llvm.call @mgpuDestroyCuSparseLtSpMat		// CHECK: llvm.call @mgpuDestroyCuSparseLtSpMat
// CHECK: llvm.call @mgpuDestroyCuSparseLtDnMat		// CHECK: llvm.call @mgpuDestroyCuSparseLtDnMat
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
func.func @matmul(%arg0: index) {		func.func @matmul(%arg0: index) {
%token0 = gpu.wait async		%token0 = gpu.wait async
%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xf16>		%mem1, %token1 = gpu.alloc async [%token0] (%arg0) : memref<?xf16>
%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf16>		%mem2, %token2 = gpu.alloc async [%token1] (%arg0) : memref<?xf16>
%spmat, %token4 = gpu.create_2to4_spmat async [%token2] %arg0, %arg0, %mem1: memref<?xf16>		%spmat, %token4 = gpu.create_2to4_spmat async [%token2] {PRUNE_AND_CHECK} %arg0, %arg0, %mem1: memref<?xf16>
%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf16>		%dnmat, %token5 = gpu.create_dn_tensor async [%token4] %mem2, %arg0, %arg0 : index, index into memref<?xf16>
%bufferSz0, %bufferSz1, %bufferSz2, %token6 = gpu.spmm_buffer_size async [%token5] %spmat, %dnmat, %dnmat : index,index,index into f16		%bufferSz0, %bufferSz1, %bufferSz2, %token6 = gpu.spmm_buffer_size async [%token5] %spmat, %dnmat, %dnmat : index,index,index into f16
%token7 = gpu.spmm async [%token6] %spmat, %dnmat, %dnmat, %mem2, %mem2, %mem2 : memref<?xf16>,memref<?xf16>,memref<?xf16> into f16		%token7 = gpu.spmm async [%token6] %spmat, %dnmat, %dnmat, %mem2, %mem2, %mem2 : memref<?xf16>,memref<?xf16>,memref<?xf16> into f16
%token8 = gpu.destroy_sp_mat async [%token7] %spmat		%token8 = gpu.destroy_sp_mat async [%token7] %spmat
%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat		%token9 = gpu.destroy_dn_tensor async [%token8] %dnmat
gpu.wait [%token9]		gpu.wait [%token9]
return		return
}		}

}		}

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul_lib_2to4.mlir

Show All 24 Lines
// CHECK: %[[VAL_22:.*]] = memref.dim %[[VAL_19]], %[[VAL_4]] : memref<?x?xf16>		// CHECK: %[[VAL_22:.*]] = memref.dim %[[VAL_19]], %[[VAL_4]] : memref<?x?xf16>
// CHECK: %[[VAL_23:.]], %[[VAL_24:.]] = gpu.alloc async {{\[}}%[[VAL_20]]] (%[[VAL_21]], %[[VAL_22]]) : memref<?x?xf16>		// CHECK: %[[VAL_23:.]], %[[VAL_24:.]] = gpu.alloc async {{\[}}%[[VAL_20]]] (%[[VAL_21]], %[[VAL_22]]) : memref<?x?xf16>
// CHECK: %[[VAL_25:.*]] = gpu.memcpy async {{\[}}%[[VAL_24]]] %[[VAL_23]], %[[VAL_19]] : memref<?x?xf16>, memref<?x?xf16>		// CHECK: %[[VAL_25:.*]] = gpu.memcpy async {{\[}}%[[VAL_24]]] %[[VAL_23]], %[[VAL_19]] : memref<?x?xf16>, memref<?x?xf16>
// CHECK: gpu.wait {{\[}}%[[VAL_11]], %[[VAL_18]], %[[VAL_25]]]		// CHECK: gpu.wait {{\[}}%[[VAL_11]], %[[VAL_18]], %[[VAL_25]]]
// CHECK: %[[VAL_26:.*]] = memref.dim %[[VAL_9]], %[[VAL_3]] : memref<?x?xf16>		// CHECK: %[[VAL_26:.*]] = memref.dim %[[VAL_9]], %[[VAL_3]] : memref<?x?xf16>
// CHECK: %[[VAL_27:.*]] = memref.dim %[[VAL_16]], %[[VAL_3]] : memref<?x?xf16>		// CHECK: %[[VAL_27:.*]] = memref.dim %[[VAL_16]], %[[VAL_3]] : memref<?x?xf16>
// CHECK: %[[VAL_28:.*]] = memref.dim %[[VAL_23]], %[[VAL_4]] : memref<?x?xf16>		// CHECK: %[[VAL_28:.*]] = memref.dim %[[VAL_23]], %[[VAL_4]] : memref<?x?xf16>
// CHECK: %[[VAL_29:.*]] = gpu.wait async		// CHECK: %[[VAL_29:.*]] = gpu.wait async
// CHECK: %[[VAL_30:.]], %[[VAL_31:.]] = gpu.create_2to4_spmat async {{\[}}%[[VAL_29]]] %[[VAL_26]], %[[VAL_27]], %[[VAL_9]] : memref<?x?xf16>		// CHECK: %[[VAL_30:.]], %[[VAL_31:.]] = gpu.create_2to4_spmat async {{\[}}%[[VAL_29]]]{{{.*}}} %[[VAL_26]], %[[VAL_27]], %[[VAL_9]] : memref<?x?xf16>
// CHECK: %[[VAL_32:.]], %[[VAL_33:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_31]]] %[[VAL_16]], %[[VAL_27]], %[[VAL_28]] : index, index into memref<?x?xf16>		// CHECK: %[[VAL_32:.]], %[[VAL_33:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_31]]] %[[VAL_16]], %[[VAL_27]], %[[VAL_28]] : index, index into memref<?x?xf16>
// CHECK: %[[VAL_34:.]], %[[VAL_35:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_33]]] %[[VAL_23]], %[[VAL_26]], %[[VAL_28]] : index, index into memref<?x?xf16>		// CHECK: %[[VAL_34:.]], %[[VAL_35:.]] = gpu.create_dn_tensor async {{\[}}%[[VAL_33]]] %[[VAL_23]], %[[VAL_26]], %[[VAL_28]] : index, index into memref<?x?xf16>
// CHECK: %[[VAL_36:.]]:3, %[[VAL_37:.]] = gpu.spmm_buffer_size async {{\[}}%[[VAL_35]]] %[[VAL_30]], %[[VAL_32]], %[[VAL_34]] : index, index, index into f16		// CHECK: %[[VAL_36:.]]:3, %[[VAL_37:.]] = gpu.spmm_buffer_size async {{\[}}%[[VAL_35]]] %[[VAL_30]], %[[VAL_32]], %[[VAL_34]] : index, index, index into f16
// CHECK: %[[VAL_38:.]], %[[VAL_39:.]] = gpu.alloc async {{\[}}%[[VAL_37]]] (%[[VAL_36]]#0) : memref<?xi8>		// CHECK: %[[VAL_38:.]], %[[VAL_39:.]] = gpu.alloc async {{\[}}%[[VAL_37]]] (%[[VAL_36]]#0) : memref<?xi8>
// CHECK: %[[VAL_40:.]], %[[VAL_41:.]] = gpu.alloc async {{\[}}%[[VAL_39]]] (%[[VAL_36]]#1) : memref<?xi8>		// CHECK: %[[VAL_40:.]], %[[VAL_41:.]] = gpu.alloc async {{\[}}%[[VAL_39]]] (%[[VAL_36]]#1) : memref<?xi8>
// CHECK: %[[VAL_42:.]], %[[VAL_43:.]] = gpu.alloc async {{\[}}%[[VAL_41]]] (%[[VAL_36]]#2) : memref<?xi8>		// CHECK: %[[VAL_42:.]], %[[VAL_43:.]] = gpu.alloc async {{\[}}%[[VAL_41]]] (%[[VAL_36]]#2) : memref<?xi8>
// CHECK: %[[VAL_44:.*]] = gpu.spmm async {{\[}}%[[VAL_43]]] %[[VAL_30]], %[[VAL_32]], %[[VAL_34]], %[[VAL_38]], %[[VAL_40]], %[[VAL_42]] : memref<?xi8>, memref<?xi8>, memref<?xi8> into f16		// CHECK: %[[VAL_44:.*]] = gpu.spmm async {{\[}}%[[VAL_43]]] %[[VAL_30]], %[[VAL_32]], %[[VAL_34]], %[[VAL_38]], %[[VAL_40]], %[[VAL_42]] : memref<?xi8>, memref<?xi8>, memref<?xi8> into f16
// CHECK: %[[VAL_45:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_44]]] %[[VAL_30]]		// CHECK: %[[VAL_45:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_44]]] %[[VAL_30]]
Show All 20 Lines	func.func @matmul(%arg0: tensor<?x?xf16>, %arg1: tensor<?x?xf16>, %arg2: tensor<?x?xf16>) -> tensor<?x?xf16> {
^bb0(%in: f16, %in_0: f16, %out: f16):		^bb0(%in: f16, %in_0: f16, %out: f16):
%1 = arith.mulf %in, %in_0 : f16		%1 = arith.mulf %in, %in_0 : f16
%2 = arith.addf %out, %1 : f16		%2 = arith.addf %out, %1 : f16
linalg.yield %2 : f16		linalg.yield %2 : f16
} -> tensor<?x?xf16>		} -> tensor<?x?xf16>
return %0 : tensor<?x?xf16>		return %0 : tensor<?x?xf16>
}		}
}		}
No newline at end of file		No newline at end of file

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir

Show All 25 Lines	func.func @sampled_matmul(%a : memref<16x32xf16>,
%c1048576 = arith.constant 1048576 : index		%c1048576 = arith.constant 1048576 : index
%token0 = gpu.wait async		%token0 = gpu.wait async
%d_a, %token1 = gpu.alloc async [%token0] () : memref<16x32xf16>		%d_a, %token1 = gpu.alloc async [%token0] () : memref<16x32xf16>
%d_b, %token2 = gpu.alloc async [%token1] () : memref<32x16xf16>		%d_b, %token2 = gpu.alloc async [%token1] () : memref<32x16xf16>
%d_c, %token3 = gpu.alloc async [%token2] () : memref<16x16xf16>		%d_c, %token3 = gpu.alloc async [%token2] () : memref<16x16xf16>
%token4 = gpu.memcpy async [%token3] %d_a, %a : memref<16x32xf16>, memref<16x32xf16>		%token4 = gpu.memcpy async [%token3] %d_a, %a : memref<16x32xf16>, memref<16x32xf16>
%token5 = gpu.memcpy async [%token4] %d_b, %b : memref<32x16xf16>, memref<32x16xf16>		%token5 = gpu.memcpy async [%token4] %d_b, %b : memref<32x16xf16>, memref<32x16xf16>
%token6 = gpu.memcpy async [%token5] %d_c, %c : memref<16x16xf16>, memref<16x16xf16>		%token6 = gpu.memcpy async [%token5] %d_c, %c : memref<16x16xf16>, memref<16x16xf16>
%spmat, %token8 = gpu.create_2to4_spmat async [%token6] %c16, %c32, %d_a: memref<16x32xf16>		%spmat, %token8 = gpu.create_2to4_spmat async [%token6]{PRUNE_AND_CHECK} %c16, %c32, %d_a: memref<16x32xf16>
%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %d_b, %c32, %c16: index, index into memref<32x16xf16>		%dnmat, %token9 = gpu.create_dn_tensor async [%token8] %d_b, %c32, %c16: index, index into memref<32x16xf16>
%dnmat2, %token10 = gpu.create_dn_tensor async [%token9] %d_c, %c16, %c16: index, index into memref<16x16xf16>		%dnmat2, %token10 = gpu.create_dn_tensor async [%token9] %d_c, %c16, %c16: index, index into memref<16x16xf16>
%bufferSz0, %bufferSz1, %bufferSz2, %token11 = gpu.spmm_buffer_size async [%token10] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2 : index, index,index into f16		%bufferSz0, %bufferSz1, %bufferSz2, %token11 = gpu.spmm_buffer_size async [%token10] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2 : index, index,index into f16
%mem1, %token12 = gpu.alloc async [%token11] (%bufferSz0) : memref<?xf16>		%mem1, %token12 = gpu.alloc async [%token11] (%bufferSz0) : memref<?xf16>
%mem2, %token13 = gpu.alloc async [%token12] (%bufferSz1) : memref<?xf16>		%mem2, %token13 = gpu.alloc async [%token12] (%bufferSz1) : memref<?xf16>
%mem3, %token14 = gpu.alloc async [%token13] (%bufferSz2) : memref<?xf16>		%mem3, %token14 = gpu.alloc async [%token13] (%bufferSz2) : memref<?xf16>
%token15 = gpu.spmm async [%token14] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2, %mem1, %mem2, %mem3 : memref<?xf16>, memref<?xf16>,memref<?xf16> into f16		%token15 = gpu.spmm async [%token14] %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2, %mem1, %mem2, %mem3 : memref<?xf16>, memref<?xf16>,memref<?xf16> into f16
%token16 = gpu.destroy_sp_mat async [%token15] %spmat		%token16 = gpu.destroy_sp_mat async [%token15] %spmat
▲ Show 20 Lines • Show All 174 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] add 2:4 spmm prune_and_check flag
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 546160

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul_lib_2to4.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] add 2:4 spmm prune_and_check flagClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 546160

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/test/Conversion/GPUCommon/lower-2to4-sparse-to-gpu-runtime-calls.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul_lib_2to4.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sm80-lt/sparse-matmul-2-4-lib.mlir

[mlir][sparse][gpu] add 2:4 spmm prune_and_check flag
ClosedPublic