This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/SparseTensor/Transforms/
-
Dialect/
-
SparseTensor/
-
Transforms/
5/5
SparseGPUCodegen.cpp
-
test/
-
Dialect/SparseTensor/GPU/
-
SparseTensor/
-
GPU/
-
gpu_combi.mlir
-
gpu_matmul.mlir
-
gpu_matvec.mlir
-
Integration/Dialect/SparseTensor/GPU/CUDA/
-
Dialect/
-
SparseTensor/
-
GPU/
-
CUDA/
-
sparse-matvec-const.mlir
-
sparse-mma-2-4-f16.mlir

Differential D148682

[mlir][sparse][gpu] generate proper memcpy in/out host and device
ClosedPublic

Authored by aartbik on Apr 18 2023, 8:57 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
herhut
bixia
Peiming
wrengr
ThomasRaoux
christopherbate
ezhulenev
george.karpenkov
K-Wu

Commits

rG86888e420c41: [mlir][sparse][gpu] generate proper memcpy in/out host and device

Summary

The host registration is a convenient way to get CUDA kernels
running, but it may be slow and does not work for all buffer
(like global constants). This revision uses the proper alloc
copy dealloc chains for buffers, using asynchronous chains
to increase overlap. The host registration mechanism is
kept under a flag for the output, just for experimentation
purposes while this project ramps up.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aartbik created this revision.Apr 18 2023, 8:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 18 2023, 8:57 PM

Herald added subscribers: bviyer, hanchung, jsetoain and 25 others. · View Herald Transcript

aartbik requested review of this revision.Apr 18 2023, 8:57 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 18 2023, 8:57 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

aartbik added reviewers: bixia, Peiming, wrengr, ThomasRaoux, christopherbate, ezhulenev, george.karpenkov, K-Wu.Apr 18 2023, 9:00 PM

Harbormaster completed remote builds in B226510: Diff 514818.Apr 18 2023, 9:06 PM

Peiming added inline comments.Apr 20 2023, 10:20 AM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
132–133	This reads wired... maybe remove the first "that"?
183	When will this condition be true? It seems it is `false` all the time (at least for now)?

addressed comments, added TODO

aartbik added inline comments.Apr 20 2023, 1:50 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
132–133	fixed the comment
183	I agree this was a bit hidden "commented out" code. I added a TODO to investigate this and/or couple this with a compiler option (I really want to keep the code around for fast experimentation, also with our intern starting soon ;-)
332	note that the TODO is here

Peiming accepted this revision.Apr 20 2023, 2:14 PM

This revision is now accepted and ready to land.Apr 20 2023, 2:14 PM

Harbormaster completed remote builds in B226971: Diff 515468.Apr 20 2023, 2:28 PM

Closed by commit rG86888e420c41: [mlir][sparse][gpu] generate proper memcpy in/out host and device (authored by aartbik). · Explain WhyApr 21 2023, 9:31 AM

This revision was automatically updated to reflect the committed changes.

aartbik added a commit: rG86888e420c41: [mlir][sparse][gpu] generate proper memcpy in/out host and device.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

SparseTensor/

Transforms/

SparseGPUCodegen.cpp

167 lines

test/

Dialect/

SparseTensor/

GPU/

gpu_combi.mlir

42 lines

gpu_matmul.mlir

34 lines

gpu_matvec.mlir

34 lines

Integration/

Dialect/

SparseTensor/

GPU/

CUDA/

sparse-matvec-const.mlir

65 lines

sparse-mma-2-4-f16.mlir

8 lines

Diff 515789

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	static gpu::GPUFuncOp genGPUFunc(OpBuilder &builder, gpu::GPUModuleOp gpuModule,
auto gpuFunc =		auto gpuFunc =
builder.create<gpu::GPUFuncOp>(gpuModule->getLoc(), kernelName, type);		builder.create<gpu::GPUFuncOp>(gpuModule->getLoc(), kernelName, type);
gpuFunc->setAttr(gpu::GPUDialect::getKernelFuncAttrName(),		gpuFunc->setAttr(gpu::GPUDialect::getKernelFuncAttrName(),
builder.getUnitAttr());		builder.getUnitAttr());
return gpuFunc;		return gpuFunc;
}		}

/// Constructs code to launch GPU kernel.		/// Constructs code to launch GPU kernel.
static void genLaunchGPUFunc(OpBuilder &builder, gpu::GPUFuncOp gpuFunc,		static Value genLaunchGPUFunc(OpBuilder &builder, gpu::GPUFuncOp gpuFunc,
SmallVectorImpl<Value> &args,		SmallVectorImpl<Value> &args,
		SmallVectorImpl<Value> &tokens,
unsigned numThreads) {		unsigned numThreads) {
Location loc = gpuFunc->getLoc();		Location loc = gpuFunc->getLoc();
Value none = TypedValue<::mlir::IntegerType>{};		Value none = TypedValue<::mlir::IntegerType>{};
Value one = constantIndex(builder, loc, 1);		Value one = constantIndex(builder, loc, 1);
Value numT = constantIndex(builder, loc, numThreads);		Value numT = constantIndex(builder, loc, numThreads);
gpu::KernelDim3 gridSize = {one, one, one};		gpu::KernelDim3 gridSize = {one, one, one};
gpu::KernelDim3 blckSize = {numT, one, one};		gpu::KernelDim3 blckSize = {numT, one, one};
builder.create<gpu::LaunchFuncOp>(loc, gpuFunc, gridSize, blckSize,		return builder
/dynSharedMemSz/ none, args);		.create<gpu::LaunchFuncOp>(loc, gpuFunc, gridSize, blckSize,
		/dynSharedMemSz/ none, args,
		builder.getType<gpu::AsyncTokenType>(), tokens)
		.getAsyncToken();
}		}

/// Maps the provided ranked host buffer into the device address space.		/// Maps the provided ranked host buffer into the device address space.
/// Writes from the host are guaranteed to be visible to device kernels		/// Writes from the host are guaranteed to be visible to device kernels
/// that are launched afterwards. Writes from the device are guaranteed		/// that are launched afterwards. Writes from the device are guaranteed
/// to be visible on the host after synchronizing with the device kernel		/// to be visible on the host after synchronizing with the device kernel
/// completion.		/// completion. Needs to cast the buffer to a unranked buffer.
static Value genHostRegisterMemref(OpBuilder &builder, Location loc,		static Value genHostRegisterMemref(OpBuilder &builder, Location loc,
Value mem) {		Value mem) {
MemRefType memTp = mem.getType().cast<MemRefType>();		MemRefType memTp = mem.getType().cast<MemRefType>();
UnrankedMemRefType resTp =		UnrankedMemRefType resTp =
UnrankedMemRefType::get(memTp.getElementType(), /memorySpace=/0);		UnrankedMemRefType::get(memTp.getElementType(), /memorySpace=/0);
Value cast = builder.create<memref::CastOp>(loc, resTp, mem);		Value cast = builder.create<memref::CastOp>(loc, resTp, mem);
builder.create<gpu::HostRegisterOp>(loc, cast);		builder.create<gpu::HostRegisterOp>(loc, cast);
return mem; // convenience pass-through		return cast;
		}

		/// Unmaps the provided buffer, expecting the casted buffer.
		static void genHostUnregisterMemref(OpBuilder &builder, Location loc,
		Value cast) {
		builder.create<gpu::HostUnregisterOp>(loc, cast);
		}

		/// Generates first wait in an asynchronous chain.
		static Value genFirstWait(OpBuilder &builder, Location loc) {
		Type tokenType = builder.getType<gpu::AsyncTokenType>();
		return builder.create<gpu::WaitOp>(loc, tokenType, ValueRange())
		.getAsyncToken();
		}

		/// Generates last, blocking wait in an asynchronous chain.
		static void genBlockingWait(OpBuilder &builder, Location loc,
		ValueRange operands) {
		builder.create<gpu::WaitOp>(loc, Type(), operands);
		}

		/// Allocates memory on the device.
		/// TODO: A `host_shared` attribute could be used to indicate that
		/// the buffer is visible by both host and device, but lowering
		/// that feature does not seem to be fully supported yet.
		PeimingUnsubmitted Done Reply Inline Actions This reads wired... maybe remove the first "that"? Peiming: This reads wired... maybe remove the first "that"?
		aartbikAuthorUnsubmitted Done Reply Inline Actions fixed the comment aartbik: fixed the comment
		static gpu::AllocOp genAllocMemRef(OpBuilder &builder, Location loc, Value mem,
		Value token) {
		auto tp = mem.getType().cast<ShapedType>();
		auto elemTp = tp.getElementType();
		auto shape = tp.getShape();
		auto memTp = MemRefType::get(shape, elemTp);
		SmallVector<Value> dynamicSizes;
		for (unsigned r = 0, rank = tp.getRank(); r < rank; r++) {
		if (shape[r] == ShapedType::kDynamic) {
		Value dim = constantIndex(builder, loc, r);
		Value dimOp = builder.create<memref::DimOp>(loc, mem, dim);
		dynamicSizes.push_back(dimOp);
		}
		}
		return builder.create<gpu::AllocOp>(loc, TypeRange({memTp, token.getType()}),
		token, dynamicSizes, ValueRange());
		}

		/// Deallocates memory from the device.
		static Value genDeallocMemRef(OpBuilder &builder, Location loc, Value mem,
		Value token) {
		return builder.create<gpu::DeallocOp>(loc, token.getType(), token, mem)
		.getAsyncToken();
		}

		/// Copies memory between host and device (direction is implicit).
		static Value genCopyMemRef(OpBuilder &builder, Location loc, Value dst,
		Value src, Value token) {
		return builder.create<gpu::MemcpyOp>(loc, token.getType(), token, dst, src)
		.getAsyncToken();
		}

		/// Prepares the outlined arguments, passing scalars and buffers in. Here we
		/// assume that the first buffer is the one allocated for output. We create
		/// a set of properly chained asynchronous allocation/copy pairs to increase
		/// overlap before launching the kernel.
		/// TODO: the output assumption may be a bit too brittle
		static Value genParametersIn(OpBuilder &builder, Location loc,
		SmallVectorImpl<Value> &scalars,
		SmallVectorImpl<Value> &buffers,
		SmallVectorImpl<Value> &args,
		SmallVectorImpl<Value> &tokens,
		bool useHostRegistrationForOut) {
		Value out;
		// Scalars are passed by value.
		for (Value s : scalars)
		args.push_back(s);
		// Buffers are need to be made visible on device.
		for (Value b : buffers) {
		if (useHostRegistrationForOut) {
		PeimingUnsubmitted Done Reply Inline Actions When will this condition be true? It seems it is `false` all the time (at least for now)? Peiming: When will this condition be true? It seems it is `false` all the time (at least for now)?
		aartbikAuthorUnsubmitted Done Reply Inline Actions I agree this was a bit hidden "commented out" code. I added a TODO to investigate this and/or couple this with a compiler option (I really want to keep the code around for fast experimentation, also with our intern starting soon ;-) aartbik: I agree this was a bit hidden "commented out" code. I added a TODO to investigate this and/or…
		out = genHostRegisterMemref(builder, loc, b);
		args.push_back(b);
		useHostRegistrationForOut = false;
		continue;
		}
		Value firstToken = genFirstWait(builder, loc);
		auto alloc = genAllocMemRef(builder, loc, b, firstToken);
		Value devMem = alloc.getResult(0);
		Value depToken = alloc.getAsyncToken(); // copy-after-alloc
		args.push_back(devMem);
		tokens.push_back(genCopyMemRef(builder, loc, devMem, b, depToken));
		}
		return out;
		}

		/// Finalizes the outlined arguments. The output buffer is copied depending
		/// on the kernel token and then deallocated. All other buffers are simply
		/// deallocated. Then we wait for all operations to complete.
		static void genParametersOut(OpBuilder &builder, Location loc, Value out,
		Value kernelToken, SmallVectorImpl<Value> &scalars,
		SmallVectorImpl<Value> &buffers,
		SmallVectorImpl<Value> &args,
		SmallVectorImpl<Value> &tokens) {
		unsigned base = scalars.size();
		for (unsigned i = base, e = args.size(); i < e; i++) {
		Value firstToken;
		if (i == base) {
		// Assumed output parameter: unregister or copy-out.
		if (out) {
		genHostUnregisterMemref(builder, loc, out);
		out = Value();
		continue;
		}
		firstToken =
		genCopyMemRef(builder, loc, buffers[0], args[i], kernelToken);
		} else {
		firstToken = genFirstWait(builder, loc);
		}
		tokens.push_back(genDeallocMemRef(builder, loc, args[i], firstToken));
		}
}		}

/// Constructs code for new GPU kernel.		/// Constructs code for new GPU kernel.
static void genGPUCode(PatternRewriter &rewriter, gpu::GPUFuncOp gpuFunc,		static void genGPUCode(PatternRewriter &rewriter, gpu::GPUFuncOp gpuFunc,
scf::ParallelOp forallOp,		scf::ParallelOp forallOp,
SmallVectorImpl<Value> &constants,		SmallVectorImpl<Value> &constants,
SmallVectorImpl<Value> &scalars,		SmallVectorImpl<Value> &scalars,
SmallVectorImpl<Value> &buffers) {		SmallVectorImpl<Value> &buffers) {
Show All 40 Lines
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Rewriting rules.		// Rewriting rules.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Proof-of-concept rewriter. This rule generates a CUDA implementation		/// Proof-of-concept rewriter. This rule generates a CUDA implementation
/// for each outermost forall loop generated by the sparse compiler.		/// for each outermost forall loop generated by the sparse compiler.
//		/// TODO: right works with parallelization-strategy=dense-outer-loop
// TODO: right works with parallelization-strategy=dense-outer-loop		/// but give this its own flags in the future
// but give this its own flags in the future
//
struct ForallRewriter : public OpRewritePattern<scf::ParallelOp> {		struct ForallRewriter : public OpRewritePattern<scf::ParallelOp> {
using OpRewritePattern<scf::ParallelOp>::OpRewritePattern;		using OpRewritePattern<scf::ParallelOp>::OpRewritePattern;

ForallRewriter(MLIRContext *context, unsigned nT)		ForallRewriter(MLIRContext *context, unsigned nT)
: OpRewritePattern(context), numThreads(nT){};		: OpRewritePattern(context), numThreads(nT){};

LogicalResult matchAndRewrite(scf::ParallelOp forallOp,		LogicalResult matchAndRewrite(scf::ParallelOp forallOp,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
Show All 33 Lines	for (Value val : invariants) {
constants.push_back(val);		constants.push_back(val);
else if (tp.isa<FloatType>() \|\| tp.isIntOrIndex())		else if (tp.isa<FloatType>() \|\| tp.isIntOrIndex())
scalars.push_back(val);		scalars.push_back(val);
else if (isa<MemRefType>(tp))		else if (isa<MemRefType>(tp))
buffers.push_back(val);		buffers.push_back(val);
else		else
return failure(); // don't know how to share		return failure(); // don't know how to share
}		}
// Prepare the outlined arguments, register buffers.		// Pass outlined non-constant values.
		// TODO: Experiment with `useHostRegistrationForOut` to see if we want to
		aartbikAuthorUnsubmitted Done Reply Inline Actions note that the TODO is here aartbik: note that the TODO is here
		// keep the feature at all (either through a heuristic or compiler
		// option for gpu codegen).
Location loc = forallOp->getLoc();		Location loc = forallOp->getLoc();
SmallVector<Value> args;		SmallVector<Value> args;
for (Value s : scalars)		SmallVector<Value> tokens;
args.push_back(s);		Value out = genParametersIn(rewriter, loc, scalars, buffers, args, tokens,
for (Value b : buffers)		/useHostRegistrationForOut=/false);
args.push_back(genHostRegisterMemref(rewriter, loc, b));
auto saveIp = rewriter.saveInsertionPoint();
// Set up GPU module and construct GPU function.		// Set up GPU module and construct GPU function.
		auto saveIp = rewriter.saveInsertionPoint();
ModuleOp topModule = forallOp->getParentOfType<ModuleOp>();		ModuleOp topModule = forallOp->getParentOfType<ModuleOp>();
auto gpuModule = genGPUModule(rewriter, topModule);		auto gpuModule = genGPUModule(rewriter, topModule);
auto gpuFunc = genGPUFunc(rewriter, gpuModule, args);		auto gpuFunc = genGPUFunc(rewriter, gpuModule, args);
genGPUCode(rewriter, gpuFunc, forallOp, constants, scalars, buffers);		genGPUCode(rewriter, gpuFunc, forallOp, constants, scalars, buffers);
// Generate code that launches the kernel.		// Generate code that launches the kernel asynchronously, blocking on all
		// opens tokens and yielding a new token for the output.
		// TODO: Passing in tokens to launch up does not seem to be properly lowered
		// by cubin yet, hence the current blocking wait.
rewriter.restoreInsertionPoint(saveIp);		rewriter.restoreInsertionPoint(saveIp);
genLaunchGPUFunc(rewriter, gpuFunc, args, numThreads);		genBlockingWait(rewriter, loc, tokens);
		tokens.clear();
		Value kernelToken =
		genLaunchGPUFunc(rewriter, gpuFunc, args, tokens, numThreads);
		// Finalize the outlined arguments.
		genParametersOut(rewriter, loc, out, kernelToken, scalars, buffers, args,
		tokens);
		genBlockingWait(rewriter, loc, tokens);
rewriter.eraseOp(forallOp);		rewriter.eraseOp(forallOp);
return success();		return success();
}		}

private:		private:
// Helper method to see if block appears in given loop.		// Helper method to see if block appears in given loop.
static bool isNestedIn(Block *block, scf::ParallelOp forallOp) {		static bool isNestedIn(Block *block, scf::ParallelOp forallOp) {
for (Operation *o = block->getParentOp(); o; o = o->getParentOp()) {		for (Operation *o = block->getParentOp(); o; o = o->getParentOp()) {
Show All 19 Lines

mlir/test/Dialect/SparseTensor/GPU/gpu_combi.mlir

	// RUN: mlir-opt %s --linalg-generalize-named-ops \			// RUN: mlir-opt %s --linalg-generalize-named-ops \
	// RUN: --pre-sparsification-rewrite \			// RUN: --pre-sparsification-rewrite \
	// RUN: --sparsification="parallelization-strategy=dense-outer-loop" \			// RUN: --sparsification="parallelization-strategy=dense-outer-loop" \
	// RUN: --sparse-gpu-codegen \| FileCheck %s			// RUN: --sparse-gpu-codegen \| FileCheck %s

	#CSR = #sparse_tensor.encoding<{ dimLevelType = [ "dense", "compressed" ] }>			#CSR = #sparse_tensor.encoding<{ dimLevelType = [ "dense", "compressed" ] }>

	//			//
	// CHECK-LABEL: gpu.module @sparse_kernels			// CHECK-LABEL: gpu.module @sparse_kernels
	// CHECK-DAG: gpu.func @kernel0			// CHECK: gpu.func @kernel1
	// CHECK-DAG: gpu.func @kernel1			// CHECK: gpu.func @kernel0
	//			//
	// CHECK-LABEL: func.func @matmuls			// CHECK-LABEL: func.func @matmuls
	// CHECK-DAG: gpu.launch_func @sparse_kernels::@kernel0 blocks			// CHECK: gpu.alloc async
	// CHECK-DAG: gpu.launch_func @sparse_kernels::@kernel1 blocks			// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: %[[T1:.*]] = gpu.launch_func async @sparse_kernels::@kernel1 blocks
				// CHECK: gpu.memcpy async [%[[T1]]]
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.wait
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: gpu.alloc async
				// CHECK: gpu.memcpy async
				// CHECK: %[[T0:.*]] = gpu.launch_func async @sparse_kernels::@kernel0 blocks
				// CHECK: gpu.memcpy async [%[[T0]]]
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.dealloc async
				// CHECK: gpu.wait
	//			//
	func.func @matmuls(%A: tensor<1024x8xf64>,			func.func @matmuls(%A: tensor<1024x8xf64>,
	%B: tensor<8x1024xf64, #CSR>,			%B: tensor<8x1024xf64, #CSR>,
	%C: tensor<1024x1024xf64, #CSR>) -> tensor<1024x1024xf64> {			%C: tensor<1024x1024xf64, #CSR>) -> tensor<1024x1024xf64> {
	%Z = arith.constant dense<0.0> : tensor<1024x1024xf64>			%Z = arith.constant dense<0.0> : tensor<1024x1024xf64>
	%T = linalg.matmul			%T = linalg.matmul
	ins(%A, %B: tensor<1024x8xf64>, tensor<8x1024xf64, #CSR>)			ins(%A, %B: tensor<1024x8xf64>, tensor<8x1024xf64, #CSR>)
	outs(%Z: tensor<1024x1024xf64>) -> tensor<1024x1024xf64>			outs(%Z: tensor<1024x1024xf64>) -> tensor<1024x1024xf64>
	%D = linalg.matmul			%D = linalg.matmul
	ins(%T, %C: tensor<1024x1024xf64>, tensor<1024x1024xf64, #CSR>)			ins(%T, %C: tensor<1024x1024xf64>, tensor<1024x1024xf64, #CSR>)
	outs(%Z: tensor<1024x1024xf64>) -> tensor<1024x1024xf64>			outs(%Z: tensor<1024x1024xf64>) -> tensor<1024x1024xf64>
	return %D : tensor<1024x1024xf64>			return %D : tensor<1024x1024xf64>
	}			}

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul.mlir

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	// CHECK: } {"Emitted from" = "linalg.generic"}			// CHECK: } {"Emitted from" = "linalg.generic"}
	// CHECK: } {"Emitted from" = "linalg.generic"}			// CHECK: } {"Emitted from" = "linalg.generic"}
	// CHECK: }			// CHECK: }
	// CHECK: gpu.return			// CHECK: gpu.return
	// CHECK: }			// CHECK: }
	//			//
	//			//
	// CHECK-LABEL: func.func @matmul			// CHECK-LABEL: func.func @matmul
	// CHECK: gpu.host_register			// CHECK: gpu.wait async
	// CHECK: gpu.host_register			// CHECK: gpu.alloc async
	// CHECK: gpu.host_register			// CHECK: %[[S0:.*]] = gpu.memcpy async
	// CHECK: gpu.host_register			// CHECK: gpu.wait async
	// CHECK: gpu.host_register			// CHECK: gpu.alloc async
	// CHECK: gpu.launch_func @sparse_kernels::@kernel0 blocks			// CHECK: %[[S1:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S2:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S3:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S4:.*]] = gpu.memcpy async
				// CHECK: gpu.wait [%[[S0]], %[[S1]], %[[S2]], %[[S3]], %[[S4]]
				// CHECK: %[[T0:.*]] = gpu.launch_func async @sparse_kernels::@kernel0 blocks
				// CHECK: %[[M0:.*]] = gpu.memcpy async [%[[T0]]]
				// CHECK: %[[M1:.*]] = gpu.dealloc async [%[[M0]]]
				// CHECK: %[[M2:.*]] = gpu.wait async
				// CHECK: %[[M3:.*]] = gpu.dealloc async [%[[M2]]]
				// CHECK: %[[M4:.*]] = gpu.wait async
				// CHECK: %[[M5:.*]] = gpu.dealloc async [%[[M4]]]
				// CHECK: %[[M6:.*]] = gpu.wait async
				// CHECK: %[[M7:.*]] = gpu.dealloc async [%[[M6]]]
				// CHECK: %[[M8:.*]] = gpu.wait async
				// CHECK: %[[M9:.*]] = gpu.dealloc async [%[[M8]]]
				// CHECK: gpu.wait [%[[M1]], %[[M3]], %[[M5]], %[[M7]], %[[M9]]
	//			//
	func.func @matmul(%A: tensor<?x?xf64, #CSR>, %B: tensor<?x?xf64>, %C_in: tensor<?x?xf64>) -> tensor<?x?xf64> {			func.func @matmul(%A: tensor<?x?xf64, #CSR>, %B: tensor<?x?xf64>, %C_in: tensor<?x?xf64>) -> tensor<?x?xf64> {
	%C_out = linalg.matmul			%C_out = linalg.matmul
	ins(%A, %B: tensor<?x?xf64, #CSR>, tensor<?x?xf64>)			ins(%A, %B: tensor<?x?xf64, #CSR>, tensor<?x?xf64>)
	outs(%C_in: tensor<?x?xf64>) -> tensor<?x?xf64>			outs(%C_in: tensor<?x?xf64>) -> tensor<?x?xf64>
	return %C_out : tensor<?x?xf64>			return %C_out : tensor<?x?xf64>
	}			}

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec.mlir

	Show All 37 Lines
	// CHECK: scf.yield %[[VAL_26]] : f64			// CHECK: scf.yield %[[VAL_26]] : f64
	// CHECK: } {"Emitted from" = "linalg.generic"}			// CHECK: } {"Emitted from" = "linalg.generic"}
	// CHECK: memref.store %[[VAL_27:.*]], %[[VAL_1]]{{\[}}%[[VAL_14]]] : memref<?xf64>			// CHECK: memref.store %[[VAL_27:.*]], %[[VAL_1]]{{\[}}%[[VAL_14]]] : memref<?xf64>
	// CHECK: }			// CHECK: }
	// CHECK: gpu.return			// CHECK: gpu.return
	// CHECK: }			// CHECK: }
	//			//
	// CHECK-LABEL: func.func @matvec			// CHECK-LABEL: func.func @matvec
	// CHECK: gpu.host_register			// CHECK: gpu.wait async
	// CHECK: gpu.host_register			// CHECK: gpu.alloc async
	// CHECK: gpu.host_register			// CHECK: %[[S0:.*]] = gpu.memcpy async
	// CHECK: gpu.host_register			// CHECK: gpu.wait async
	// CHECK: gpu.host_register			// CHECK: gpu.alloc async
	// CHECK: gpu.launch_func @sparse_kernels::@kernel0 blocks			// CHECK: %[[S1:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S2:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S3:.*]] = gpu.memcpy async
				// CHECK: gpu.wait async
				// CHECK: gpu.alloc async
				// CHECK: %[[S4:.*]] = gpu.memcpy async
				// CHECK: gpu.wait [%[[S0]], %[[S1]], %[[S2]], %[[S3]], %[[S4]]
				// CHECK: %[[T0:.*]] = gpu.launch_func async @sparse_kernels::@kernel0 blocks
				// CHECK: %[[M0:.*]] = gpu.memcpy async [%[[T0]]]
				// CHECK: %[[M1:.*]] = gpu.dealloc async [%[[M0]]]
				// CHECK: %[[M2:.*]] = gpu.wait async
				// CHECK: %[[M3:.*]] = gpu.dealloc async [%[[M2]]]
				// CHECK: %[[M4:.*]] = gpu.wait async
				// CHECK: %[[M5:.*]] = gpu.dealloc async [%[[M4]]]
				// CHECK: %[[M6:.*]] = gpu.wait async
				// CHECK: %[[M7:.*]] = gpu.dealloc async [%[[M6]]]
				// CHECK: %[[M8:.*]] = gpu.wait async
				// CHECK: %[[M9:.*]] = gpu.dealloc async [%[[M8]]]
				// CHECK: gpu.wait [%[[M1]], %[[M3]], %[[M5]], %[[M7]], %[[M9]]
	//			//
	func.func @matvec(%A: tensor<?x?xf64, #CSR>, %x: tensor<?xf64>, %y_in: tensor<?xf64>) -> tensor<?xf64> {			func.func @matvec(%A: tensor<?x?xf64, #CSR>, %x: tensor<?xf64>, %y_in: tensor<?xf64>) -> tensor<?xf64> {
	%y_out = linalg.matvec			%y_out = linalg.matvec
	ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)			ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)
	outs(%y_in: tensor<?xf64>) -> tensor<?xf64>			outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
	return %y_out : tensor<?xf64>			return %y_out : tensor<?xf64>
	}			}

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matvec-const.mlir

This file was added.

				//
				// NOTE: this test requires gpu-sm80
				//
				// RUN: mlir-opt %s \
				// RUN: --sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_80 gpu-features=+ptx71" \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_cuda_runtime \
				// RUN: --shared-libs=%mlir_runner_utils \
				// RUN: --e main --entry-point-result=void \
				// RUN: \| FileCheck %s

				#CSR = #sparse_tensor.encoding<{ dimLevelType = [ "dense", "compressed" ] }>

				module {
				// Compute matrix vector y = Ax
				func.func @matvec(%A: tensor<1024x64xf64, #CSR>, %x: tensor<64xf64>, %y_in: tensor<1024xf64>) -> tensor<1024xf64> {
				%y_out = linalg.matvec
				ins(%A, %x: tensor<1024x64xf64, #CSR>, tensor<64xf64>)
				outs(%y_in: tensor<1024xf64>) -> tensor<1024xf64>
				return %y_out : tensor<1024xf64>
				}

				memref.global "private" constant @__constant_64xf64 : memref<64xf64> = dense<1.000000e+00> {alignment = 64 : i64}

				func.func @main() {
				%f0 = arith.constant 0.0 : f64
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index

				// Stress test with a dense matrix DA.
				%DA = tensor.generate {
				^bb0(%i: index, %j: index):
				%k = arith.addi %i, %j : index
				%l = arith.index_cast %k : index to i64
				%f = arith.uitofp %l : i64 to f64
				tensor.yield %f : f64
				} : tensor<1024x64xf64>

				// Convert to a "sparse" 1024 x 64 matrix A.
				%A = sparse_tensor.convert %DA : tensor<1024x64xf64> to tensor<1024x64xf64, #CSR>

				// Initialize dense vector to 1024 zeros.
				%y = tensor.generate {
				^bb0(%i : index):
				tensor.yield %f0 : f64
				} : tensor<1024xf64>

				// Call the kernel with an vector taken from global memory.
				%xbuf = memref.get_global @__constant_64xf64 : memref<64xf64>
				%x = bufferization.to_tensor %xbuf restrict : memref<64xf64>
				%0 = call @matvec(%A, %x, %y) : (tensor<1024x64xf64, #CSR>, tensor<64xf64>, tensor<1024xf64>) -> tensor<1024xf64>

				//
				// Sanity check on results.
				//
				// CHECK: ( 2016, 2080, 2144, 2208, 2272, 2336, 2400, 2464, 2528, 2592, 2656, 2720, 2784, 2848, 2912, 2976, 3040, 3104, 3168, 3232, 3296, 3360, 3424, 3488, 3552, 3616, 3680, 3744, 3808, 3872, 3936, 4000, 4064, 4128, 4192, 4256, 4320, 4384, 4448, 4512, 4576, 4640, 4704, 4768, 4832, 4896, 4960, 5024, 5088, 5152, 5216, 5280, 5344, 5408, 5472, 5536, 5600, 5664, 5728, 5792, 5856, 5920, 5984, 6048 )
				//
				%pb0 = vector.transfer_read %0[%c0], %f0 : tensor<1024xf64>, vector<64xf64>
				vector.print %pb0 : vector<64xf64>

				// Release the resources.
				bufferization.dealloc_tensor %A : tensor<1024x64xf64, #CSR>
				return
				}
				}

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-mma-2-4-f16.mlir

Show First 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	func.func @main() {
// CHECK-NEXT: ( 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24 )		// CHECK-NEXT: ( 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24 )
//		//
//		//
scf.for %pbi = %c0 to %c8 step %c1 {		scf.for %pbi = %c0 to %c8 step %c1 {
%pb0 = vector.transfer_read %b[%pbi, %c0], %f0 : memref<8x32xf16>, vector<32xf16>		%pb0 = vector.transfer_read %b[%pbi, %c0], %f0 : memref<8x32xf16>, vector<32xf16>
vector.print %pb0 : vector<32xf16>		vector.print %pb0 : vector<32xf16>
}		}

// Maps the provided host buffer into the device address space.		// Maps the provided host buffers into the device address space.
// Writes from the host are guaranteed to be visible to device		// Writes from the host are guaranteed to be visible to device
// kernels that are launched afterwards. Writes from the device		// kernels that are launched afterwards. Writes from the device
// are guaranteed to be visible on the host after synchronizing		// are guaranteed to be visible on the host after synchronizing
// with the device kernel completion.		// with the device kernel completion.
%cast_a = memref.cast %a : memref<16x16xf16> to memref<*xf16>		%cast_a = memref.cast %a : memref<16x16xf16> to memref<*xf16>
gpu.host_register %cast_a : memref<*xf16>		gpu.host_register %cast_a : memref<*xf16>
%cast_m = memref.cast %m : memref<16x2xi16> to memref<*xi16>		%cast_m = memref.cast %m : memref<16x2xi16> to memref<*xi16>
gpu.host_register %cast_m : memref<*xi16>		gpu.host_register %cast_m : memref<*xi16>
Show All 9 Lines	gpu.launch_func
@kernels::@mma_sp_sync_f16_16832		@kernels::@mma_sp_sync_f16_16832
blocks in (%t1, %t1, %t1) // gridSizeX,Y,Z		blocks in (%t1, %t1, %t1) // gridSizeX,Y,Z
threads in (%t32, %t1, %t1) // blockSizeX,Y,Z		threads in (%t32, %t1, %t1) // blockSizeX,Y,Z
args(%a : memref<16x16xf16>,		args(%a : memref<16x16xf16>,
%m : memref<16x2xi16>,		%m : memref<16x2xi16>,
%b : memref<8x32xf16>,		%b : memref<8x32xf16>,
%c : memref<16x8xf16>)		%c : memref<16x8xf16>)

		// Unmaps the host buffers.
		gpu.host_unregister %cast_a : memref<*xf16>
		gpu.host_unregister %cast_m : memref<*xi16>
		gpu.host_unregister %cast_b : memref<*xf16>
		gpu.host_unregister %cast_c : memref<*xf16>

//		//
// Verify computed matrix C.		// Verify computed matrix C.
//		//
// CHECK-NEXT: ( -2720, -2584, -2448, -2312, -2176, -2040, -1904, -1768 )		// CHECK-NEXT: ( -2720, -2584, -2448, -2312, -2176, -2040, -1904, -1768 )
// CHECK-NEXT: ( -2960, -2808, -2656, -2504, -2352, -2200, -2048, -1896 )		// CHECK-NEXT: ( -2960, -2808, -2656, -2504, -2352, -2200, -2048, -1896 )
// CHECK-NEXT: ( -3200, -3032, -2864, -2696, -2528, -2360, -2192, -2024 )		// CHECK-NEXT: ( -3200, -3032, -2864, -2696, -2528, -2360, -2192, -2024 )
// CHECK-NEXT: ( -3440, -3256, -3072, -2888, -2704, -2520, -2336, -2152 )		// CHECK-NEXT: ( -3440, -3256, -3072, -2888, -2704, -2520, -2336, -2152 )
// CHECK-NEXT: ( -3680, -3480, -3280, -3080, -2880, -2680, -2480, -2280 )		// CHECK-NEXT: ( -3680, -3480, -3280, -3080, -2880, -2680, -2480, -2280 )
Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] generate proper memcpy in/out host and deviceClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 515789

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/test/Dialect/SparseTensor/GPU/gpu_combi.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matvec-const.mlir

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-mma-2-4-f16.mlir

[mlir][sparse][gpu] generate proper memcpy in/out host and device
ClosedPublic