This is an archive of the discontinued LLVM Phabricator instance.

%2 = linalg.generic #trait_sampled_dense_dense
  ins(%args, %arga, %argb: tensor<8x8xf64, #SM>,
                           tensor<8x8xf64>, tensor<8x8xf64>)
  outs(%1: tensor<8x8xf64, #SM>) {
    ^bb(%s: f64, %a: f64, %b: f64, %x: f64):
      %p = arith.mulf %a, %b : f64
      %q = arith.mulf %s, %p : f64
      %r = arith.addf %x, %q : f64
      linalg.yield %r : f64
} -> tensor<8x8xf64, #SM>
return %2 : tensor<8x8xf64, #SM>

fix integration test

Harbormaster completed remote builds in B237621: Diff 529771.Jun 8 2023, 4:36 PM

K-Wu published this revision for review.Jun 8 2023, 5:33 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJun 8 2023, 5:33 PM

Herald added a reviewer: herhut. · View Herald TranscriptJun 8 2023, 5:33 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

K-Wu added reviewers: bixia, Peiming, wrengr, anlunx.Jun 8 2023, 5:33 PM

rebase; clean up a bit rule

reformat

Harbormaster completed remote builds in B237804: Diff 530010.Jun 9 2023, 11:11 AM

aartbik added inline comments.Jun 12 2023, 4:15 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
623	we test that C is sparse, but no test at all for D So, the code below will sometimes crash, since it uses information on C not D (e.g. isCOO)
628	this comment says more than is done below, is this a note to self, or something else?
632	you miss the d : row here
644	This assumes that D has exactly the same pattern already as C (or more precisely, the right length for all the arrays into which we copy the result). Perhaps we should start with simply returning C again after this operation, and find a way to allow this at IR level.
723	This is the wrong replacement, we need rewriter.replaceOpWithNewOp<sparse_tensor::LoadOp>(op, ....) and probably without inserts.
mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir
111	can you remove this part with reuse completely, I think we will just need the pattern above for proper testing
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir
47	dense output, does not go through the GPU kernel then, right?

upd test cases to reflect actual need though not passing yet

incorporating aart's ad hoc solution

aartbik added inline comments.Jun 12 2023, 6:54 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
617	please fix 80-col breaks here in comment
619	I think we need a if (op.getDpsInitOperand(0) != c) return failure(); here to enforce the ad-hoc pattern
mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir
105	perhaps add TODO: find a cleaner way to express this
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir
48	Put the comment about the ad-hoc idiom here too, so we don't forget. Also the TODO.
93	Just curious if you really want to use the from-file set up of the sampling matrix? We can also just create this in memory like so many other examples, right?

Harbormaster completed remote builds in B238369: Diff 530744.Jun 12 2023, 7:37 PM

aartbik added inline comments.Jun 12 2023, 8:36 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
619	Or probably more precisely after preproc %0 = bufferization.alloc_tensor() copy(%arg0) op.getDpsInitOperand(0).getDefiningOp links to %0 = bufferization.alloc_tensor() copy(%c)
723	Make sure to return sparse_tensor.load %arg0 instead of just %arg0, since the rematerialization needs to be made explicit. You will find an example of this in e.g. the method sparse_simply_dynamic1() in the sparse_out.mlir test

K-Wu marked 6 inline comments as done.Jun 12 2023, 8:44 PM

still fixing

addressing comments

K-Wu marked an inline comment as done.Jun 13 2023, 11:11 AM

correcting the sddmm rewriter and tests except for the recognition rule

addressing comment

K-Wu marked 2 inline comments as done.Jun 13 2023, 1:21 PM

Harbormaster completed remote builds in B238599: Diff 531045.Jun 13 2023, 3:12 PM

incorporating the new pattern

passed tests it seems

K-Wu added inline comments.Jun 14 2023, 2:45 PM

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir
93	Okay I will incorporate it next time

Peiming added inline comments.Jun 14 2023, 2:54 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
334–351	The nest is too deep here. Could you either try to break it into more functions or early returns on some simple conditions? e.g., if (!isa_and_nonnull(def)) return false; Value s_out = xxx xxx

Harbormaster completed remote builds in B238958: Diff 531523.Jun 14 2023, 3:41 PM

aartbik added inline comments.Jun 14 2023, 4:03 PM

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir
48	We can now use https://reviews.llvm.org/D152969 and remove all comments about hacks, since this is proper in-place definition that matches cuSPARSE precisely.

aartbik added inline comments.Jun 14 2023, 4:06 PM

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
274 ↗	(On Diff #531523)	actually with the new formulation, you no longer need this since this is proper S(I,J) = SPY( S(I,J) ) x SUM_K A(I,K) B(K,J) * alpha + beta * S(I,J)

K-Wu marked an inline comment as done.Jun 14 2023, 4:26 PM

removing obsolete comments

Harbormaster completed remote builds in B238994: Diff 531567.Jun 14 2023, 4:54 PM

K-Wu marked an inline comment as done.Jun 14 2023, 5:00 PM

K-Wu added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
334–351	Good point and example! Incorporated

K-Wu marked 2 inline comments as done.Jun 14 2023, 5:25 PM

incorporating comments

Harbormaster completed remote builds in B239007: Diff 531581.Jun 14 2023, 5:42 PM

Peiming added inline comments.Jun 14 2023, 7:19 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
335	You need to do a null check too

addressing comment

rebase origin/main

Harbormaster completed remote builds in B239026: Diff 531603.Jun 14 2023, 10:13 PM

K-Wu marked an inline comment as done.Jun 15 2023, 1:23 AM

aartbik added inline comments.Jun 15 2023, 10:55 AM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
332	Add comment // Helper to detect .....
344	return s_out == use->getOperand(0); ?
348	This seems way too simple. You pretty much match any %1 = sparse_tensor.unary %2 = sparse_tensor.reduce but you don't check that %1 produces A * B and %2 produces sum accumulation It will match the current example, but also much more! Please fix
641	remove this TODO. The current idiom has in-place semantics that will be correctly dealt with by bufferization
mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir
116	releasing %0 feels cleaner, so that we have the in-place behavior clear

rebase

K-Wu marked 4 inline comments as done.Jun 15 2023, 11:35 AM

addressing comments

fixing compile errors

aartbik added inline comments.Jun 15 2023, 12:08 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
335	You will often find the idiom if (Operation def = ....getDefiningOp()) { use def } for this purpose, which you can even combine with the isa if (Operation def getDefiningOp<...>
353	technically, you have not checked if unary feeds into reduce, right? or did I miss that?

fix errors

Harbormaster completed remote builds in B239208: Diff 531869.Jun 15 2023, 12:39 PM

fixing runtime error

fix format

K-Wu marked an inline comment as done.Jun 15 2023, 2:14 PM

K-Wu added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
344	This is checked two times in the loop so I don't find an easy way to simplify the logic

K-Wu marked an inline comment as done.Jun 15 2023, 2:55 PM

Harbormaster completed remote builds in B239233: Diff 531897.Jun 15 2023, 3:04 PM

aartbik accepted this revision.Jun 15 2023, 4:24 PM

This revision is now accepted and ready to land.Jun 15 2023, 4:24 PM

Closed by commit rG9167dd46ba52: [mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen path (authored by K-Wu). · Explain WhyJun 15 2023, 4:48 PM

This revision was automatically updated to reflect the committed changes.

K-Wu added a commit: rG9167dd46ba52: [mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen path.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

SparseTensor/

Transforms/

SparseGPUCodegen.cpp

119 lines

test/

Dialect/

SparseTensor/

GPU/

gpu_sampled_matmul_lib.mlir

162 lines

Integration/

Dialect/

SparseTensor/

GPU/

CUDA/

sparse-sampled-matmul-lib.mlir

119 lines

Diff 529431

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

Show First 20 Lines • Show All 323 Lines • ▼ Show 20 Lines	if (isa<arith::AddFOp>(def) \|\| isa<arith::AddIOp>(def)) {
matchMulOfArgs(op, def->getOperand(1))) \|\|		matchMulOfArgs(op, def->getOperand(1))) \|\|
(def->getOperand(1) == x &&		(def->getOperand(1) == x &&
matchMulOfArgs(op, def->getOperand(0)));		matchMulOfArgs(op, def->getOperand(0)));
}		}
}		}
return false;		return false;
}		}

/// Test for sorted COO with suitable data and coordinates types.		/// Test for sorted COO with suitable data and coordinates types.
		aartbikUnsubmitted Done Reply Inline Actions Add comment // Helper to detect ..... aartbik: Add comment // Helper to detect .....
static bool isAdmissibleCOO(SparseTensorType &aTp) {		static bool isAdmissibleCOO(SparseTensorType &aTp) {
return aTp.isCompressedLvl(0) && aTp.isOrderedLvl(0) && !aTp.isUniqueLvl(0) &&		return aTp.isCompressedLvl(0) && aTp.isOrderedLvl(0) && !aTp.isUniqueLvl(0) &&
aTp.isSingletonLvl(1) && aTp.isOrderedLvl(1) && aTp.isUniqueLvl(1) &&		aTp.isSingletonLvl(1) && aTp.isOrderedLvl(1) && aTp.isUniqueLvl(1) &&
		PeimingUnsubmitted Done Reply Inline Actions You need to do a null check too Peiming: You need to do a null check too
		aartbikUnsubmitted Done Reply Inline Actions You will often find the idiom if (Operation def = ....getDefiningOp()) { use def } for this purpose, which you can even combine with the isa if (Operation def getDefiningOp<...> aartbik: You will often find the idiom if (Operation *def = ....getDefiningOp()) { use def } for…
(aTp.getElementType().isF64() \|\| aTp.getElementType().isF32()) &&		(aTp.getElementType().isF64() \|\| aTp.getElementType().isF32()) &&
(aTp.getCrdWidth() == 0 \|\| aTp.getCrdWidth() == 32 \|\|		(aTp.getCrdWidth() == 0 \|\| aTp.getCrdWidth() == 32 \|\|
aTp.getCrdWidth() == 64);		aTp.getCrdWidth() == 64);
}		}

/// Test for CSR with suitable data and coordinates types.		/// Test for CSR with suitable data and coordinates types.
static bool isAdmissibleCSR(SparseTensorType &aTp) {		static bool isAdmissibleCSR(SparseTensorType &aTp) {
return aTp.isDenseLvl(0) && aTp.isCompressedLvl(1) && aTp.isOrderedLvl(1) &&		return aTp.isDenseLvl(0) && aTp.isCompressedLvl(1) && aTp.isOrderedLvl(1) &&
aTp.isUniqueLvl(1) &&		aTp.isUniqueLvl(1) &&
		aartbikUnsubmitted Done Reply Inline Actions return s_out == use->getOperand(0); ? aartbik: return s_out == use->getOperand(0); ?
		K-WuAuthorUnsubmitted Done Reply Inline Actions This is checked two times in the loop so I don't find an easy way to simplify the logic K-Wu: This is checked two times in the loop so I don't find an easy way to simplify the logic
(aTp.getElementType().isF64() \|\| aTp.getElementType().isF32()) &&		(aTp.getElementType().isF64() \|\| aTp.getElementType().isF32()) &&
(aTp.getCrdWidth() == 0 \|\| aTp.getCrdWidth() == 32 \|\|		(aTp.getCrdWidth() == 0 \|\| aTp.getCrdWidth() == 32 \|\|
aTp.getCrdWidth() == 64);		aTp.getCrdWidth() == 64);
}		}
		aartbikUnsubmitted Done Reply Inline Actions This seems way too simple. You pretty much match any %1 = sparse_tensor.unary %2 = sparse_tensor.reduce but you don't check that %1 produces A * B and %2 produces sum accumulation It will match the current example, but also much more! Please fix aartbik: This seems way too simple. You pretty much match any %1 = sparse_tensor.unary %2 =…

/// Test for admissible types on operands (with output parameter `isCOO`).		/// Test for admissible types on operands (with output parameter `isCOO`).
static bool areAdmissibleTypes(SparseTensorType aTp, SparseTensorType bTp,		static bool areAdmissibleTypes(SparseTensorType aTp, SparseTensorType bTp,
		PeimingUnsubmitted Done Reply Inline Actions The nest is too deep here. Could you either try to break it into more functions or early returns on some simple conditions? e.g., if (!isa_and_nonnull(def)) return false; Value s_out = xxx xxx Peiming: The nest is too deep here. Could you either try to break it into more functions or early…
		K-WuAuthorUnsubmitted Done Reply Inline Actions Good point and example! Incorporated K-Wu: Good point and example! Incorporated
SparseTensorType cTp, bool enableRT,		SparseTensorType cTp, bool enableRT,
bool &isCOO) {		bool &isCOO) {
		aartbikUnsubmitted Done Reply Inline Actions technically, you have not checked if unary feeds into reduce, right? or did I miss that? aartbik: technically, you have not checked if unary feeds into reduce, right? or did I miss that?
if (bTp.hasEncoding() \|\| cTp.hasEncoding())		if (bTp.hasEncoding() \|\| cTp.hasEncoding())
return false;		return false;
if (isAdmissibleCOO(aTp)) {		if (isAdmissibleCOO(aTp)) {
isCOO = true;		isCOO = true;
return enableRT; // TODO: CreateCooAoSOp was deprecated, find another way		return enableRT; // TODO: CreateCooAoSOp was deprecated, find another way
}		}
return isAdmissibleCSR(aTp);		return isAdmissibleCSR(aTp);
}		}
▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	static LogicalResult rewriteSpMM(PatternRewriter &rewriter,
genBlockingWait(rewriter, loc, tokens);		genBlockingWait(rewriter, loc, tokens);
tokens.clear();		tokens.clear();

// Done.		// Done.
rewriter.replaceOpWithNewOp<bufferization::ToTensorOp>(op, bufC);		rewriter.replaceOpWithNewOp<bufferization::ToTensorOp>(op, bufC);
return success();		return success();
}		}

		/// Match and rewrite SDDMM kernel.
		static LogicalResult rewriteSDDMM(PatternRewriter &rewriter, linalg::GenericOp op,
		bool enableRT) {
		Location loc = op.getLoc();
		Value a = op.getOperand(1);
		Value b = op.getOperand(2);
		Value c = op.getOperand(0); // we have C = AB
		SmallVector<Value> tokens;

		// Only admissible sparse matrix format and dense matrices.
		aartbikUnsubmitted Done Reply Inline Actions please fix 80-col breaks here in comment aartbik: please fix 80-col breaks here in comment
		bool isCOO = false;
		SparseTensorType aTp = getSparseTensorType(a);
		aartbikUnsubmitted Done Reply Inline Actions I think we need a if (op.getDpsInitOperand(0) != c) return failure(); here to enforce the ad-hoc pattern aartbik: I think we need a if (op.getDpsInitOperand(0) != c) return failure(); here to enforce the ad…
		aartbikUnsubmitted Done Reply Inline Actions Or probably more precisely after preproc %0 = bufferization.alloc_tensor() copy(%arg0) op.getDpsInitOperand(0).getDefiningOp links to %0 = bufferization.alloc_tensor() copy(%c) aartbik: Or probably more precisely after preproc %0 = bufferization.alloc_tensor() copy(%arg0) op.
		SparseTensorType bTp = getSparseTensorType(b);
		SparseTensorType cTp = getSparseTensorType(c);
		if (!areAdmissibleTypes(aTp, bTp, cTp, enableRT, isCOO))
		return failure();
		aartbikUnsubmitted Done Reply Inline Actions we test that C is sparse, but no test at all for D So, the code below will sometimes crash, since it uses information on C not D (e.g. isCOO) aartbik: we test that C is sparse, but no test at all for D So, the code below will sometimes crash…

		// TODO: duplicate C and update its user later than this GenericOp line if it
		// is used by other ops.

		// the following does the in-place operation. If the sparse matrix C is
		aartbikUnsubmitted Done Reply Inline Actions this comment says more than is done below, is this a note to self, or something else? aartbik: this comment says more than is done below, is this a note to self, or something else?
		// reused, we may need to copy it before the operation so that users could use
		// the new copy instead. Start sparse kernel and copy data from host to
		// device.
		// TODO: assert no other uses of c is later than this GenericOp operator
		aartbikUnsubmitted Done Reply Inline Actions you miss the d : row here aartbik: you miss the d : row here

		// a : bufA -> matA
		// b : bufB -> matA
		// c : memR/memC/memV -> rowC,colC,valC
		Value nseC = rewriter.create<NumberOfEntriesOp>(loc, a);
		Value szm = linalg::createOrFoldDimOp(rewriter, loc, a, 0);
		Value szk = linalg::createOrFoldDimOp(rewriter, loc, a, 1);
		Value szn = linalg::createOrFoldDimOp(rewriter, loc, b, 1);
		Value bufA = genTensorToMemref(rewriter, loc, a);
		aartbikUnsubmitted Done Reply Inline Actions remove this TODO. The current idiom has in-place semantics that will be correctly dealt with by bufferization aartbik: remove this TODO. The current idiom has in-place semantics that will be correctly dealt with by…
		Value matA = genAllocCopy(rewriter, loc, bufA, tokens);
		Value bufB = genTensorToMemref(rewriter, loc, b);
		Value matB = genAllocCopy(rewriter, loc, bufB, tokens);
		aartbikUnsubmitted Done Reply Inline Actions This assumes that D has exactly the same pattern already as C (or more precisely, the right length for all the arrays into which we copy the result). Perhaps we should start with simply returning C again after this operation, and find a way to allow this at IR level. aartbik: This assumes that D has exactly the same pattern already as C (or more precisely, the right…
		Value memR = genFirstPosOrCrds(rewriter, loc, c, isCOO, enableRT);
		Value memC = genSecondCrds(rewriter, loc, c, isCOO, enableRT);
		Value memV = genToValues(rewriter, loc, c);
		Value rowC = genAllocCopy(rewriter, loc, memR, tokens);
		Value colC = memC ? genAllocCopy(rewriter, loc, memC, tokens) : Value();
		Value valC = genAllocCopy(rewriter, loc, memV, tokens);
		genBlockingWait(rewriter, loc, tokens);
		tokens.clear();

		// Create sparse environment and sparse matrix/dense matrix handles.
		Type indexTp = rewriter.getIndexType();
		Type envHandleTp = rewriter.getType<gpu::SparseEnvHandleType>();
		Type dnMatHandleTp = rewriter.getType<gpu::SparseDnMatHandleType>();
		Type spMatHandleTp = rewriter.getType<gpu::SparseSpMatHandleType>();
		Type tokenTp = rewriter.getType<gpu::AsyncTokenType>();
		Value token = genFirstWait(rewriter, loc);
		auto env =
		rewriter.create<gpu::CreateSparseEnvOp>(loc, envHandleTp, tokenTp, token);
		Value handle = env.getResult(0);
		token = env.getAsyncToken();

		auto dmatA = rewriter.create<gpu::CreateDnMatOp>(loc, dnMatHandleTp, tokenTp,
		token, szm, szk, matA);
		Value dnA = dmatA.getResult(0);
		token = dmatA.getAsyncToken();
		auto dmatB = rewriter.create<gpu::CreateDnMatOp>(loc, dnMatHandleTp, tokenTp,
		token, szk, szn, matB);
		Value dnB = dmatB.getResult(0);
		token = dmatB.getAsyncToken();

		Operation *spGenC =
		genSpMat(rewriter, loc, spMatHandleTp, tokenTp, token, szm, szn, nseC,
		rowC, colC, valC, isCOO, enableRT);
		Value spMatC = spGenC->getResult(0);
		token = spGenC->getResult(1);

		// Precompute buffersize for SDDMM.
		auto bufferComp = rewriter.create<gpu::SDDMMBufferSizeOp>(
		loc, indexTp, tokenTp, token, handle, dnA, dnB, spMatC);
		Value bufferSz = bufferComp.getResult(0);
		token = bufferComp.getAsyncToken();
		auto buf = genAllocBuffer(rewriter, loc, bufferSz, token);
		Value buffer = buf.getResult(0);
		token = buf.getAsyncToken();

		// Perform the SDDMM.
		auto sddmmComp = rewriter.create<gpu::SDDMMOp>(loc, tokenTp, token, handle,
		dnA, dnB, spMatC, buffer);
		token = sddmmComp.getAsyncToken();

		// Copy data back to host and free all the resoures.
		token = rewriter.create<gpu::DestroyDnMatOp>(loc, tokenTp, token, dnA)
		.getAsyncToken();
		token = rewriter.create<gpu::DestroyDnMatOp>(loc, tokenTp, token, dnB)
		.getAsyncToken();
		token = rewriter.create<gpu::DestroySpMatOp>(loc, tokenTp, token, spMatC)
		.getAsyncToken();
		token = rewriter.create<gpu::DestroySparseEnvOp>(loc, tokenTp, token, handle)
		.getAsyncToken();
		tokens.push_back(token);
		genBlockingWait(rewriter, loc, tokens);
		tokens.clear();
		token = genFirstWait(rewriter, loc);
		token = genCopyMemRef(rewriter, loc, memR, rowC, token);
		token = genCopyMemRef(rewriter, loc, memC, colC, token);
		token = genCopyMemRef(rewriter, loc, memV, valC, token);
		token = genDeallocMemRef(rewriter, loc, buffer, token);
		token = genDeallocMemRef(rewriter, loc, matA, token);
		token = genDeallocMemRef(rewriter, loc, matB, token);
		token = genDeallocMemRef(rewriter, loc, rowC, token);
		if (colC)
		token = genDeallocMemRef(rewriter, loc, colC, token);
		token = genDeallocMemRef(rewriter, loc, valC, token);
		tokens.push_back(token);
		genBlockingWait(rewriter, loc, tokens);
		tokens.clear();

		// Done.
		rewriter.replaceOp(op, op.getDpsInitOperand(0)->get());
		aartbikUnsubmitted Done Reply Inline Actions This is the wrong replacement, we need rewriter.replaceOpWithNewOp<sparse_tensor::LoadOp>(op, ....) and probably without inserts. aartbik: This is the wrong replacement, we need rewriter.replaceOpWithNewOp<sparse_tensor::LoadOp>(op, .
		aartbikUnsubmitted Done Reply Inline Actions Make sure to return sparse_tensor.load %arg0 instead of just %arg0, since the rematerialization needs to be made explicit. You will find an example of this in e.g. the method sparse_simply_dynamic1() in the sparse_out.mlir test aartbik: Make sure to return sparse_tensor.load %arg0 instead of just %arg0, since the…
		return success();
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Rewriting rules for direct code generation.		// Rewriting rules for direct code generation.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Proof-of-concept rewriter. This rule generates a GPU implementation		/// Proof-of-concept rewriter. This rule generates a GPU implementation
/// for each outermost forall loop generated by the sparse compiler.		/// for each outermost forall loop generated by the sparse compiler.
/// TODO: right works with parallelization-strategy=dense-outer-loop		/// TODO: right works with parallelization-strategy=dense-outer-loop
/// but give this its own flags in the future		/// but give this its own flags in the future
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	if (numLoops == 3 && numTensors == 3 &&
linalg::isReductionIterator(iteratorTypes[2]) &&		linalg::isReductionIterator(iteratorTypes[2]) &&
// TODO: add transposed {i, k}, {k, j}		// TODO: add transposed {i, k}, {k, j}
// TODO: maybe add transposed {i, j} in future		// TODO: maybe add transposed {i, j} in future
maps == infer({{i, k}, {k, j}, {i, j}}) && matchSumOfMultOfArgs(op)) {		maps == infer({{i, k}, {k, j}, {i, j}}) && matchSumOfMultOfArgs(op)) {
return rewriteSpMM(rewriter, op, enableRT);		return rewriteSpMM(rewriter, op, enableRT);
}		}

return failure();		return failure();
}		}
		K-WuAuthorUnsubmitted Done Reply Inline Actions rewrite rule here K-Wu: rewrite rule here

private:		private:
bool enableRT;		bool enableRT;
};		};

} // namespace		} // namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
Show All 18 Lines

mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir

This file was added.

				// RUN: mlir-opt %s --linalg-generalize-named-ops \
				// RUN: --sparsification="enable-gpu-libgen" \| FileCheck %s

				#trait_sampled_dense_dense = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j)>, // S
				affine_map<(i,j,k) -> (i,k)>, // A
				affine_map<(i,j,k) -> (k,j)>, // B
				affine_map<(i,j,k) -> (i,j)> // X (out)
				],
				iterator_types = ["parallel", "parallel", "reduction"],
				doc = "X(i,j) += S(i,j) SUM_k A(i,k) B(k,j)"
				}

				#trait_vec_op = {
				indexing_maps = [
				affine_map<(i,j) -> (i,j)>, // a (in)
				affine_map<(i,j) -> (i,j)>, // b (in)
				affine_map<(i,j) -> (i,j)> // x (out)
				],
				iterator_types = ["parallel", "parallel"]
				}

				#CSR = #sparse_tensor.encoding<{ lvlTypes = [ "dense", "compressed" ] }>

				module {

				//
				// A kernel that computes a direct sampled matrix matrix multiplication
				// (with sparse result).
				// Compute SDDMM C = C\spy AB
				// VAL_0 is C
				//
				// CHECK-LABEL: func.func @sampled_dense_dense_matmul(
				// CHECK-SAME: %[[VAL_0:.]]: tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<?x?xf64>,
				// CHECK-SAME: %[[VAL_2:.*]]: tensor<?x?xf64>) -> tensor<?x?xf64> {
				// CHECK-DAG: %[[VAL_3:.*]] = arith.constant 0 : index
				// CHECK-DAG: %[[VAL_4:.*]] = arith.constant 1 : index
				// CHECK-DAG: %[[VAL_5:.]] = sparse_tensor.number_of_entries %[[VAL_0]] : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>>
				// CHECK-DAG: %[[VAL_6:.]] = tensor.dim %[[VAL_0]], %[[VAL_3]] : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>>
				// CHECK-DAG: %[[VAL_7:.]] = tensor.dim %[[VAL_0]], %[[VAL_4]] : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>>
				// CHECK-DAG: %[[VAL_8:.*]] = tensor.dim %[[VAL_1]], %[[VAL_4]] : tensor<?x?xf64>
				// CHECK-DAG: %[[VAL_9:.]] = sparse_tensor.positions %[[VAL_0]] {level = 1 : index} : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>> to memref<?xindex>
				// CHECK-DAG: %[[VAL_10:.]] = sparse_tensor.coordinates %[[VAL_0]] {level = 1 : index} : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>> to memref<?xindex>
				// CHECK-DAG: %[[VAL_11:.]] = sparse_tensor.values %[[VAL_0]] : tensor<?x?xf64, #sparse_tensor.encoding<{{{.}}}>> to memref<?xf64>
				// CHECK: %[[VAL_12:.*]] = gpu.wait async
				// CHECK: %[[VAL_13:.*]] = memref.dim %[[VAL_9]], %[[VAL_3]] : memref<?xindex>
				// CHECK: %[[VAL_14:.]], %[[VAL_15:.]] = gpu.alloc async {{\[}}%[[VAL_12]]] (%[[VAL_13]]) : memref<?xindex>
				// CHECK: %[[VAL_16:.*]] = gpu.memcpy async {{\[}}%[[VAL_15]]] %[[VAL_14]], %[[VAL_9]] : memref<?xindex>, memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = gpu.wait async
				// CHECK: %[[VAL_18:.*]] = memref.dim %[[VAL_10]], %[[VAL_3]] : memref<?xindex>
				// CHECK: %[[VAL_19:.]], %[[VAL_20:.]] = gpu.alloc async {{\[}}%[[VAL_17]]] (%[[VAL_18]]) : memref<?xindex>
				// CHECK: %[[VAL_21:.*]] = gpu.memcpy async {{\[}}%[[VAL_20]]] %[[VAL_19]], %[[VAL_10]] : memref<?xindex>, memref<?xindex>
				// CHECK: %[[VAL_22:.*]] = gpu.wait async
				// CHECK: %[[VAL_23:.*]] = memref.dim %[[VAL_11]], %[[VAL_3]] : memref<?xf64>
				// CHECK: %[[VAL_24:.]], %[[VAL_25:.]] = gpu.alloc async {{\[}}%[[VAL_22]]] (%[[VAL_23]]) : memref<?xf64>
				// CHECK: %[[VAL_26:.*]] = gpu.memcpy async {{\[}}%[[VAL_25]]] %[[VAL_24]], %[[VAL_11]] : memref<?xf64>, memref<?xf64>
				// CHECK: %[[VAL_27:.*]] = bufferization.to_memref %[[VAL_1]] : memref<?x?xf64>
				// CHECK: %[[VAL_28:.*]] = gpu.wait async
				// CHECK: %[[VAL_29:.*]] = memref.dim %[[VAL_27]], %[[VAL_3]] : memref<?x?xf64>
				// CHECK: %[[VAL_30:.*]] = memref.dim %[[VAL_27]], %[[VAL_4]] : memref<?x?xf64>
				// CHECK: %[[VAL_31:.]], %[[VAL_32:.]] = gpu.alloc async {{\[}}%[[VAL_28]]] (%[[VAL_29]], %[[VAL_30]]) : memref<?x?xf64>
				// CHECK: %[[VAL_33:.*]] = gpu.memcpy async {{\[}}%[[VAL_32]]] %[[VAL_31]], %[[VAL_27]] : memref<?x?xf64>, memref<?x?xf64>
				// CHECK: %[[VAL_34:.*]] = bufferization.to_memref %[[VAL_2]] : memref<?x?xf64>
				// CHECK: %[[VAL_35:.*]] = gpu.wait async
				// CHECK: %[[VAL_36:.*]] = memref.dim %[[VAL_34]], %[[VAL_3]] : memref<?x?xf64>
				// CHECK: %[[VAL_37:.*]] = memref.dim %[[VAL_34]], %[[VAL_4]] : memref<?x?xf64>
				// CHECK: %[[VAL_38:.]], %[[VAL_39:.]] = gpu.alloc async {{\[}}%[[VAL_35]]] (%[[VAL_36]], %[[VAL_37]]) : memref<?x?xf64>
				// CHECK: %[[VAL_40:.*]] = gpu.memcpy async {{\[}}%[[VAL_39]]] %[[VAL_38]], %[[VAL_34]] : memref<?x?xf64>, memref<?x?xf64>
				// CHECK: gpu.wait {{\[}}%[[VAL_16]], %[[VAL_21]], %[[VAL_26]], %[[VAL_33]], %[[VAL_40]]]
				// CHECK: %[[VAL_41:.*]] = gpu.wait async
				// CHECK: %[[VAL_42:.]], %[[VAL_43:.]] = gpu.create_sparse_env async {{\[}}%[[VAL_41]]]
				// CHECK: %[[VAL_44:.]], %[[VAL_45:.]] = gpu.create_csr async {{\[}}%[[VAL_43]]] %[[VAL_6]], %[[VAL_7]], %[[VAL_5]], %[[VAL_14]], %[[VAL_19]], %[[VAL_24]] : memref<?xindex>, memref<?xindex>, memref<?xf64>
				// CHECK: %[[VAL_46:.]], %[[VAL_47:.]] = gpu.create_dn_mat async {{\[}}%[[VAL_45]]] %[[VAL_7]], %[[VAL_8]], %[[VAL_31]] : memref<?x?xf64>
				// CHECK: %[[VAL_48:.]], %[[VAL_49:.]] = gpu.create_dn_mat async {{\[}}%[[VAL_47]]] %[[VAL_6]], %[[VAL_8]], %[[VAL_38]] : memref<?x?xf64>
				// CHECK: %[[VAL_50:.]], %[[VAL_51:.]] = gpu.sddmm_buffer_size async {{\[}}%[[VAL_49]]] %[[VAL_42]], %[[VAL_44]], %[[VAL_46]], %[[VAL_48]]
				// CHECK: %[[VAL_52:.]], %[[VAL_53:.]] = gpu.alloc async {{\[}}%[[VAL_51]]] (%[[VAL_50]]) : memref<?xi8>
				// CHECK: %[[VAL_54:.*]] = gpu.sddmm async {{\[}}%[[VAL_53]]] %[[VAL_42]], %[[VAL_44]], %[[VAL_46]], %[[VAL_48]], %[[VAL_52]] : memref<?xi8>
				// CHECK: %[[VAL_55:.*]] = gpu.destroy_sp_mat async {{\[}}%[[VAL_54]]] %[[VAL_44]]
				// CHECK: %[[VAL_56:.*]] = gpu.destroy_dn_mat async {{\[}}%[[VAL_55]]] %[[VAL_46]]
				// CHECK: %[[VAL_57:.*]] = gpu.destroy_dn_mat async {{\[}}%[[VAL_56]]] %[[VAL_48]]
				// CHECK: %[[VAL_58:.*]] = gpu.destroy_sparse_env async {{\[}}%[[VAL_57]]] %[[VAL_42]]
				// CHECK: %[[VAL_59:.*]] = gpu.dealloc async {{\[}}%[[VAL_58]]] %[[VAL_14]] : memref<?xindex>
				// CHECK: %[[VAL_60:.*]] = gpu.dealloc async {{\[}}%[[VAL_59]]] %[[VAL_19]] : memref<?xindex>
				// CHECK: %[[VAL_61:.*]] = gpu.dealloc async {{\[}}%[[VAL_60]]] %[[VAL_24]] : memref<?xf64>
				// CHECK: %[[VAL_62:.*]] = gpu.dealloc async {{\[}}%[[VAL_61]]] %[[VAL_52]] : memref<?xi8>
				// CHECK: %[[VAL_63:.*]] = gpu.dealloc async {{\[}}%[[VAL_62]]] %[[VAL_31]] : memref<?x?xf64>
				K-WuAuthorUnsubmitted Done Reply Inline Actions can we instead output #CSR type here? K-Wu: can we instead output #CSR type here?
				// CHECK: %[[VAL_64:.*]] = gpu.memcpy async {{\[}}%[[VAL_63]]] %[[VAL_34]], %[[VAL_38]] : memref<?x?xf64>, memref<?x?xf64>
				// CHECK: %[[VAL_65:.*]] = gpu.dealloc async {{\[}}%[[VAL_64]]] %[[VAL_38]] : memref<?x?xf64>
				// CHECK: gpu.wait {{\[}}%[[VAL_65]]]
				// CHECK: %[[VAL_66:.*]] = bufferization.to_tensor %[[VAL_34]] : memref<?x?xf64>
				// CHECK: return %[[VAL_66]] : tensor<?x?xf64>
				// CHECK: }
				func.func @sparse_sampled_dd(%args: tensor<8x8xf64, #CSR>,
				%arga: tensor<8x8xf64>,
				%argb: tensor<8x8xf64>) -> tensor<8x8xf64, #CSR> {
				%1 = bufferization.alloc_tensor() : tensor<8x8xf64, #CSR>
				K-WuAuthorUnsubmitted Done Reply Inline Actions %1 = sparse_tensor_in_place_result %args: tensor<8x8xf64, #SM> // something new, I have to think about it %2 = linalg.generic #trait_sampled_dense_dense ins(%args, %arga, %argb: tensor<8x8xf64, #SM>, tensor<8x8xf64>, tensor<8x8xf64>) outs(%1: tensor<8x8xf64, #SM>) { ^bb(%s: f64, %a: f64, %b: f64, %x: f64): %p = arith.mulf %a, %b : f64 %q = arith.mulf %s, %p : f64 %r = arith.addf %x, %q : f64 linalg.yield %r : f64 } -> tensor<8x8xf64, #SM> return %2 : tensor<8x8xf64, #SM> K-Wu: %1 = sparse_tensor_in_place_result %args: tensor<8x8xf64, #SM> // something new, I have to…
				%2 = linalg.generic #trait_sampled_dense_dense
				ins(%args, %arga, %argb: tensor<8x8xf64, #CSR>,
				tensor<8x8xf64>, tensor<8x8xf64>)
				outs(%1: tensor<8x8xf64, #CSR>) {
				^bb(%s: f64, %a: f64, %b: f64, %x: f64):
				%p = arith.mulf %a, %b : f64
				%q = arith.mulf %s, %p : f64
				aartbikUnsubmitted Done Reply Inline Actions perhaps add TODO: find a cleaner way to express this aartbik: perhaps add TODO: find a cleaner way to express this
				%r = arith.addf %x, %q : f64
				linalg.yield %r : f64
				} -> tensor<8x8xf64, #CSR>
				return %2 : tensor<8x8xf64, #CSR>
				}

				aartbikUnsubmitted Done Reply Inline Actions can you remove this part with reuse completely, I think we will just need the pattern above for proper testing aartbik: can you remove this part with reuse completely, I think we will just need the pattern above for…
				func.func @sparse_sampled_dd_with_reuse(%args: tensor<8x8xf64, #CSR>,
				%arga: tensor<8x8xf64>,
				%argb: tensor<8x8xf64>) -> tensor<8x8xf64, #CSR> {
				%1 = bufferization.alloc_tensor() : tensor<8x8xf64, #CSR>
				%2 = linalg.generic #trait_sampled_dense_dense
				ins(%args, %arga, %argb: tensor<8x8xf64, #CSR>,
				tensor<8x8xf64>, tensor<8x8xf64>)
				outs(%1: tensor<8x8xf64, #CSR>) {
				^bb(%s: f64, %a: f64, %b: f64, %x: f64):
				%p = arith.mulf %a, %b : f64
				%q = arith.mulf %s, %p : f64
				%r = arith.addf %x, %q : f64
				linalg.yield %r : f64
				} -> tensor<8x8xf64, #CSR>

				// reuse the input by doing the computation again
				%3 = bufferization.alloc_tensor() : tensor<8x8xf64, #CSR>
				%4 = linalg.generic #trait_sampled_dense_dense
				ins(%args, %arga, %argb: tensor<8x8xf64, #CSR>,
				tensor<8x8xf64>, tensor<8x8xf64>)
				outs(%3: tensor<8x8xf64, #CSR>) {
				^bb(%s: f64, %a: f64, %b: f64, %x: f64):
				%p = arith.mulf %a, %b : f64
				%q = arith.mulf %s, %p : f64
				%r = arith.addf %x, %q : f64
				linalg.yield %r : f64
				} -> tensor<8x8xf64, #CSR>

				// elementwise-min operation to produce the return tensor

				%5 = bufferization.alloc_tensor() : tensor<8x8xf64, #CSR>
				%6 = linalg.generic #trait_vec_op
				ins(%2, %4: tensor<8x8xf64, #CSR>, tensor<8x8xf64, #CSR>)
				outs(%5: <8x8xf64, #CSR>) {
				^bb(%a: f64, %b: f64, %x: f64):
				%1 = sparse_tensor.binary %a, %b : f64, f64 to f64
				overlap={
				^bb0(%a0: f64, %b0: f64):
				%2 = arith.minsi %a0, %b0: f64
				sparse_tensor.yield %2 : f64
				}
				left=identity
				right=identity
				linalg.yield %1 : f64
				} -> tensor<8x8xf64, #CSR>
				return %0 : tensor<8x8xf64, #CSR>

				return %2 : tensor<8x8xf64, #CSR>
				}

				}

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir

This file was added.

				//
				// NOTE: this test requires gpu-sm80
				//
				//
				// RUN: mlir-opt %s \
				// RUN: --sparse-compiler="enable-runtime-library=true enable-gpu-libgen gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71" \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_cuda_runtime \
				// RUN: --shared-libs=%mlir_c_runner_utils \
				// RUN: --e main --entry-point-result=void \
				// RUN: \| FileCheck %s
				//

				!Filename = !llvm.ptr<i8>

				#SparseMatrix = #sparse_tensor.encoding<{
				lvlTypes = [ "compressed", "compressed" ],
				posWidth = 32,
				crdWidth = 32
				}>

				#trait_sampled_dense_dense = {
				indexing_maps = [
				affine_map<(i,j,k) -> (i,j)>, // S
				affine_map<(i,j,k) -> (i,k)>, // A
				affine_map<(i,j,k) -> (k,j)>, // B
				affine_map<(i,j,k) -> (i,j)> // X (out)
				],
				iterator_types = ["parallel", "parallel", "reduction"],
				doc = "X(i,j) += S(i,j) SUM_k A(i,k) B(k,j)"
				}

				//
				// Integration test that lowers a kernel annotated as sparse to
				// actual sparse code, initializes a matching sparse storage scheme
				// from file, and runs the resulting code with the JIT compiler.
				//
				module {
				//
				// A kernel that computes a sampled matrix matrix multiplication.
				//
				func.func @sampled_dense_dense(%args: tensor<?x?xf32, #SparseMatrix>,
				%arga: tensor<?x?xf32>,
				K-WuAuthorUnsubmitted Done Reply Inline Actions TODO: add checking logic K-Wu: TODO: add checking logic
				K-WuAuthorUnsubmitted Done Reply Inline Actions Didn't notice there is already checking logic down below. Closing this K-Wu: Didn't notice there is already checking logic down below. Closing this
				%argb: tensor<?x?xf32>,
				%argx: tensor<?x?xf32>) -> tensor<?x?xf32> {
				%0 = linalg.generic #trait_sampled_dense_dense
				ins(%args, %arga, %argb: tensor<?x?xf32, #SparseMatrix>, tensor<?x?xf32>, tensor<?x?xf32>)
				aartbikUnsubmitted Done Reply Inline Actions dense output, does not go through the GPU kernel then, right? aartbik: dense output, does not go through the GPU kernel then, right?
				outs(%argx: tensor<?x?xf32>) {
				aartbikUnsubmitted Done Reply Inline Actions Put the comment about the ad-hoc idiom here too, so we don't forget. Also the TODO. aartbik: Put the comment about the ad-hoc idiom here too, so we don't forget. Also the TODO.
				aartbikUnsubmitted Done Reply Inline Actions We can now use https://reviews.llvm.org/D152969 and remove all comments about hacks, since this is proper in-place definition that matches cuSPARSE precisely. aartbik: We can now use https://reviews.llvm.org/D152969 and remove all comments about hacks, since this…
				^bb(%s: f32, %a: f32, %b: f32, %x: f32):
				%0 = arith.mulf %a, %b : f32
				%1 = arith.mulf %s, %0 : f32
				%2 = arith.addf %x, %1 : f32
				linalg.yield %2 : f32
				} -> tensor<?x?xf32>
				return %0 : tensor<?x?xf32>
				}

				func.func private @getTensorFilename(index) -> (!Filename)

				//
				// Main driver that reads matrix from file and calls the sparse kernel.
				//
				func.func @entry() {
				%d0 = arith.constant 0.0 : f32
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c5 = arith.constant 5 : index
				%c10 = arith.constant 10 : index

				// Initialize dense matrices.
				%x = tensor.generate %c5, %c5 {
				^bb0(%i : index, %j : index):
				tensor.yield %d0 : f32
				} : tensor<?x?xf32>

				%a = tensor.generate %c5, %c10 {
				^bb0(%i: index, %j: index):
				%p = arith.addi %i, %c1 : index
				%q = arith.index_cast %p : index to i32
				%d = arith.sitofp %q : i32 to f32
				tensor.yield %d : f32
				} : tensor<?x?xf32>

				%b = tensor.generate %c10, %c5 {
				^bb0(%i: index, %j: index):
				%p = arith.addi %j, %c1 : index
				%q = arith.index_cast %p : index to i32
				%d = arith.sitofp %q : i32 to f32
				tensor.yield %d : f32
				} : tensor<?x?xf32>

				// Read the sparse matrix from file, construct sparse storage.
				%fileName = call @getTensorFilename(%c0) : (index) -> (!Filename)
				aartbikUnsubmitted Not Done Reply Inline Actions Just curious if you really want to use the from-file set up of the sampling matrix? We can also just create this in memory like so many other examples, right? aartbik: Just curious if you really want to use the from-file set up of the sampling matrix? We can also…
				K-WuAuthorUnsubmitted Done Reply Inline Actions Okay I will incorporate it next time K-Wu: Okay I will incorporate it next time
				%s = sparse_tensor.new %fileName : !Filename to tensor<?x?xf32, #SparseMatrix>

				// Call the kernel.
				%0 = call @sampled_dense_dense(%s, %a, %b, %x)
				: (tensor<?x?xf32, #SparseMatrix>,
				tensor<?x?xf32>, tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32>

				// Print the result for verification.
				//
				// CHECK: ( 10, 0, 0, 56, 0 )
				// CHECK: ( 0, 80, 0, 0, 250 )
				// CHECK: ( 0, 0, 270, 0, 0 )
				// CHECK: ( 164, 0, 0, 640, 0 )
				// CHECK: ( 0, 520, 0, 0, 1250 )
				//
				scf.for %i = %c0 to %c5 step %c1 {
				%v = vector.transfer_read %0[%i, %c0], %d0: tensor<?x?xf32>, vector<5xf32>
				vector.print %v : vector<5xf32>
				}

				// Release the resources.
				bufferization.dealloc_tensor %s : tensor<?x?xf32, #SparseMatrix>

				aartbikUnsubmitted Done Reply Inline Actions releasing %0 feels cleaner, so that we have the in-place behavior clear aartbik: releasing %0 feels cleaner, so that we have the in-place behavior clear
				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen pathClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 529431

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/test/Dialect/SparseTensor/GPU/gpu_sampled_matmul_lib.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-sampled-matmul-lib.mlir

[mlir][sparse][gpu] recognizing sddmm pattern in GPU libgen path
ClosedPublic