This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] a first prototype sparse GPU code generator
ClosedPublic

Authored by aartbik on Apr 3 2023, 4:21 PM.

Download Raw Diff

Details

Reviewers

bixia
Peiming
ThomasRaoux
nicolasvasilache
herhut
christopherbate
guraypp
K-Wu
ezhulenev
wrengr
anlunx

Commits

rG19466ebc7ff8: [mlir][sparse][gpu] a first prototype sparse GPU code generator

Summary

This implements a proof-of-concept GPU code generator
to the sparse compiler pipeline, currently only capable
of generating CUDA threads for outermost parallel loops.

The objective, obviously, is to grow this concept
to a full blown GPU code generator, capable of the
right combinaton of code generation as well as exploiting
idiomatic kernels or vector specific libraries (think cuSparse).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aartbik created this revision.Apr 3 2023, 4:21 PM

Herald added a reviewer: ThomasRaoux. · View Herald TranscriptApr 3 2023, 4:21 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: hanchung, jsetoain, Moerafaat and 23 others. · View Herald Transcript

aartbik requested review of this revision.Apr 3 2023, 4:21 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 3 2023, 4:21 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

aartbik added reviewers: christopherbate, guraypp, K-Wu, ezhulenev.Apr 3 2023, 4:24 PM

aartbik added reviewers: wrengr, anlunx.Apr 3 2023, 4:38 PM

Harbormaster completed remote builds in B223460: Diff 510639.Apr 3 2023, 4:42 PM

Peiming added inline comments.Apr 3 2023, 5:13 PM

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
168	We should be able to support reduction, right? By rewriting parallel reduce => gpu.*_reduce maybe? Maybe it is a future work to you :-)
169–170	Seems that we should be able support this too by tweaking the thread mapping? (though parallal for generated by sparse compile will always use 0 as the lower bound and 1 as the step).

aartbik marked an inline comment as done.Apr 3 2023, 5:18 PM

aartbik added inline comments.

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp
168	Agreed. CUDA has some very nifty reduction primitives we should use, but yeah, all future work. The first step, for now, is to get the pipeline working. This revision CHECK test. After that, an end-to-end test. Once we have the building blocks more or less in place, the fun starts! (also, perhaps we don't want a forall rewriter, but bake this directly into loop emitter; I am not sure yet; the idioms, like 2:4 will need work along either path)
169–170	Yeah, agreed. I kept the computation simple (for now),but we really should even be able to support i = lo; i < hi; i+= step) eventually with some more work

aartbik added a child revision: D147571: [mlir][sparse][gpu] sparse GPU code generator pipeline setup.Apr 4 2023, 1:59 PM

Looks good to me. It looks like there is more work to get to something that can be performant (memory and block distribution are being inefficient). But that's a great start!

This revision is now accepted and ready to land.Apr 5 2023, 10:16 AM

In D147483#4246481, @ThomasRaoux wrote:

Looks good to me. It looks like there is more work to get to something that can be performant (memory and block distribution are being inefficient). But that's a great start!

Absolutely (in fact, some articles point out this is absolutely *not* the right way to make SpMV parallel :-).
But this prototype is incremental, i.e. first get the pipeline up and running, then get an idea on the basic building blocks required, and then go from there!

rebased with main

This revision was landed with ongoing or failed builds.Apr 5 2023, 11:32 AM

Closed by commit rG19466ebc7ff8: [mlir][sparse][gpu] a first prototype sparse GPU code generator (authored by aartbik). · Explain Why

This revision was automatically updated to reflect the committed changes.

aartbik added a commit: rG19466ebc7ff8: [mlir][sparse][gpu] a first prototype sparse GPU code generator.

Harbormaster completed remote builds in B223840: Diff 511152.Apr 5 2023, 11:39 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

SparseTensor/

Transforms/

Passes.h

6 lines

Passes.td

20 lines

lib/

Dialect/

SparseTensor/

Transforms/

CMakeLists.txt

1 line

SparseGPUCodegen.cpp

247 lines

SparseTensorPasses.cpp

25 lines

test/

Dialect/

SparseTensor/

GPU/

gpu_matmul.mlir

61 lines

gpu_matvec.mlir

58 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

1 line

Diff 511170

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h

Show First 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	void populateSparseVectorizationPatterns(RewritePatternSet &patterns,
bool enableVLAVectorization,		bool enableVLAVectorization,
bool enableSIMDIndex32);		bool enableSIMDIndex32);

std::unique_ptr<Pass> createSparseVectorizationPass();		std::unique_ptr<Pass> createSparseVectorizationPass();
std::unique_ptr<Pass> createSparseVectorizationPass(unsigned vectorLength,		std::unique_ptr<Pass> createSparseVectorizationPass(unsigned vectorLength,
bool enableVLAVectorization,		bool enableVLAVectorization,
bool enableSIMDIndex32);		bool enableSIMDIndex32);

		void populateSparseGPUCodegenPatterns(RewritePatternSet &patterns,
		unsigned numThreads);

		std::unique_ptr<Pass> createSparseGPUCodegenPass();
		std::unique_ptr<Pass> createSparseGPUCodegenPass(unsigned numThreads);

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Registration.		// Registration.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Generate the code for registering passes.		/// Generate the code for registering passes.
#define GEN_PASS_REGISTRATION		#define GEN_PASS_REGISTRATION
#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"		#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"

} // namespace mlir		} // namespace mlir

#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES_H_		#endif // MLIR_DIALECT_SPARSETENSOR_TRANSFORMS_PASSES_H_

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td

Show First 20 Lines • Show All 304 Lines • ▼ Show 20 Lines	Option<"vectorLength", "vl", "int32_t", "0",
"Set the vector length (use 0 to disable vectorization)">,		"Set the vector length (use 0 to disable vectorization)">,
Option<"enableVLAVectorization", "enable-vla-vectorization", "bool",		Option<"enableVLAVectorization", "enable-vla-vectorization", "bool",
"false", "Enable vector length agnostic vectorization">,		"false", "Enable vector length agnostic vectorization">,
Option<"enableSIMDIndex32", "enable-simd-index32", "bool", "false",		Option<"enableSIMDIndex32", "enable-simd-index32", "bool", "false",
"Enable i32 indexing into vectors (for efficient gather/scatter)">,		"Enable i32 indexing into vectors (for efficient gather/scatter)">,
];		];
}		}

		def SparseGPUCodegen : Pass<"sparse-gpu-codegen", "ModuleOp"> {
		let summary = "Generates GPU code during sparsification";
		let description = [{
		Enables sparse compiler to use GPU acceleration.
		}];
		let constructor = "mlir::createSparseGPUCodegenPass()";
		let dependentDialects = [
		"arith::ArithDialect",
		"bufferization::BufferizationDialect",
		"gpu::GPUDialect",
		"linalg::LinalgDialect",
		"memref::MemRefDialect",
		"scf::SCFDialect",
		"sparse_tensor::SparseTensorDialect",
		];
		let options = [
		Option<"numThreads", "num_threads", "int32_t", "1024", "Sets the number of GPU threads">,
		];
		}

def StorageSpecifierToLLVM : Pass<"sparse-storage-specifier-to-llvm", "ModuleOp"> {		def StorageSpecifierToLLVM : Pass<"sparse-storage-specifier-to-llvm", "ModuleOp"> {
let summary = "Lower sparse storage specifer to llvm structure";		let summary = "Lower sparse storage specifer to llvm structure";
let description = [{		let description = [{
This pass rewrites sparse tensor storage specifier-related operations into		This pass rewrites sparse tensor storage specifier-related operations into
LLVMDialect, and converts sparse tensor storage specifier into an llvm.struct.		LLVMDialect, and converts sparse tensor storage specifier into an llvm.struct.

Example of the conversion:		Example of the conversion:
```mlir		```mlir
Show All 17 Lines

mlir/lib/Dialect/SparseTensor/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRSparseTensorTransforms			add_mlir_dialect_library(MLIRSparseTensorTransforms
	BufferizableOpInterfaceImpl.cpp			BufferizableOpInterfaceImpl.cpp
	CodegenEnv.cpp			CodegenEnv.cpp
	CodegenUtils.cpp			CodegenUtils.cpp
	LoopEmitter.cpp			LoopEmitter.cpp
	SparseBufferRewriting.cpp			SparseBufferRewriting.cpp
				SparseGPUCodegen.cpp
	SparseStorageSpecifierToLLVM.cpp			SparseStorageSpecifierToLLVM.cpp
	SparseTensorCodegen.cpp			SparseTensorCodegen.cpp
	SparseTensorConversion.cpp			SparseTensorConversion.cpp
	SparseTensorPasses.cpp			SparseTensorPasses.cpp
	SparseTensorRewriting.cpp			SparseTensorRewriting.cpp
	SparseTensorStorageLayout.cpp			SparseTensorStorageLayout.cpp
	SparseVectorization.cpp			SparseVectorization.cpp
	Sparsification.cpp			Sparsification.cpp
	Show All 30 Lines

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

This file was added.

				//===- SparseGPUCodegen.cpp - Generates GPU code (using CUDA) -------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This is a prototype GPU codegenerator for the sparse compiler.
				// The objective is to eventually use the right combination of
				// direct code generation and libary calls into vendor-specific
				// highly optimized sparse libraries (e.g. cuSparse for CUDA).
				//
				//===----------------------------------------------------------------------===//

				#include "CodegenUtils.h"
				#include "LoopEmitter.h"

				#include "mlir/Dialect/Bufferization/IR/Bufferization.h"
				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/MemRef/IR/MemRef.h"
				#include "mlir/Dialect/SCF/IR/SCF.h"
				#include "mlir/Dialect/SparseTensor/IR/SparseTensor.h"
				#include "mlir/Dialect/SparseTensor/Transforms/Passes.h"
				#include "mlir/IR/IRMapping.h"
				#include "mlir/IR/Matchers.h"

				using namespace mlir;
				using namespace mlir::sparse_tensor;

				namespace {

				//===----------------------------------------------------------------------===//
				// Helper methods.
				//===----------------------------------------------------------------------===//

				/// Marks the given top module as a GPU container module.
				static void markAsGPUContainer(ModuleOp topModule) {
				topModule->setAttr(gpu::GPUDialect::getContainerModuleAttrName(),
				UnitAttr::get(topModule->getContext()));
				}

				/// Constructs a new GPU module (for GPU kernels) inside the given top module.
				static gpu::GPUModuleOp genGPUModule(OpBuilder &builder, ModuleOp topModule,
				StringRef name) {
				markAsGPUContainer(topModule);
				builder.setInsertionPointToStart(&topModule.getBodyRegion().front());
				return builder.create<gpu::GPUModuleOp>(topModule->getLoc(), name);
				}

				/// Constructs a new GPU kernel in the given GPU module.
				static gpu::GPUFuncOp genGPUFunc(OpBuilder &builder, gpu::GPUModuleOp gpuModule,
				StringRef name, SmallVectorImpl<Value> &args) {
				builder.setInsertionPointToStart(&gpuModule.getBodyRegion().front());
				SmallVector<Type> argsTp;
				for (unsigned i = 0, e = args.size(); i < e; i++)
				argsTp.push_back(args[i].getType());
				FunctionType type = FunctionType::get(gpuModule->getContext(), argsTp, {});
				auto gpuFunc =
				builder.create<gpu::GPUFuncOp>(gpuModule->getLoc(), name, type);
				gpuFunc->setAttr(gpu::GPUDialect::getKernelFuncAttrName(),
				builder.getUnitAttr());
				return gpuFunc;
				}

				/// Constructs code to launch GPU kernel.
				static void genLaunchGPUFunc(OpBuilder &builder, gpu::GPUFuncOp gpuFunc,
				SmallVectorImpl<Value> &args,
				unsigned numThreads) {
				Location loc = gpuFunc->getLoc();
				Value none = TypedValue<::mlir::IntegerType>{};
				Value one = constantIndex(builder, loc, 1);
				Value numT = constantIndex(builder, loc, numThreads);
				gpu::KernelDim3 gridSize = {one, one, one};
				gpu::KernelDim3 blckSize = {numT, one, one};
				builder.create<gpu::LaunchFuncOp>(loc, gpuFunc, gridSize, blckSize,
				/dynSharedMemSz/ none, args);
				}

				/// Maps the provided ranked host buffer into the device address space.
				/// Writes from the host are guaranteed to be visible to device kernels
				/// that are launched afterwards. Writes from the device are guaranteed
				/// to be visible on the host after synchronizing with the device kernel
				/// completion.
				static Value genHostRegisterMemref(OpBuilder &builder, Location loc,
				Value mem) {
				MemRefType memTp = mem.getType().cast<MemRefType>();
				UnrankedMemRefType resTp =
				UnrankedMemRefType::get(memTp.getElementType(), /memorySpace=/0);
				Value cast = builder.create<memref::CastOp>(loc, resTp, mem);
				builder.create<gpu::HostRegisterOp>(loc, cast);
				return mem; // convenience pass-through
				}

				/// Constructs code for new GPU kernel.
				static void genGPUCode(PatternRewriter &rewriter, gpu::GPUFuncOp gpuFunc,
				scf::ParallelOp forallOp,
				SmallVectorImpl<Value> &constants,
				SmallVectorImpl<Value> &scalars,
				SmallVectorImpl<Value> &buffers) {
				Location loc = gpuFunc->getLoc();
				Block &block = gpuFunc.getBody().front();
				rewriter.setInsertionPointToStart(&block);

				// Re-generate the constants, recapture all arguments.
				unsigned arg = 0;
				IRMapping irMap;
				for (Value c : constants)
				irMap.map(c, rewriter.clone(*c.getDefiningOp())->getResult(0));
				for (Value s : scalars)
				irMap.map(s, block.getArgument(arg++));
				for (Value b : buffers)
				irMap.map(b, block.getArgument(arg++));

				// Assume 1-dimensional grid/block configuration (only x dimension),
				// so that:
				// row = blockIdx.x * blockDim.x + threadIdx.x
				// inc = blockDim.x * gridDim.x
				Value bid = rewriter.create<gpu::BlockIdOp>(loc, gpu::Dimension::x);
				Value bsz = rewriter.create<gpu::BlockDimOp>(loc, gpu::Dimension::x);
				Value tid = rewriter.create<gpu::ThreadIdOp>(loc, gpu::Dimension::x);
				Value gsz = rewriter.create<gpu::GridDimOp>(loc, gpu::Dimension::x);
				Value mul = rewriter.create<arith::MulIOp>(loc, bid, bsz);
				Value row = rewriter.create<arith::AddIOp>(loc, mul, tid);
				Value inc = rewriter.create<arith::MulIOp>(loc, bsz, gsz);

				// Construct the iteration over the computational space that
				// accounts for the fact that the total number of threads and
				// the amount of work to be done usually do not match precisely.
				// for (r = row; r < N; r += inc) {
				// <loop-body>
				// }
				Value upper = irMap.lookup(forallOp.getUpperBound()[0]);
				scf::ForOp forOp = rewriter.create<scf::ForOp>(loc, row, upper, inc);
				rewriter.cloneRegionBefore(forallOp.getLoopBody(), forOp.getLoopBody(),
				forOp.getLoopBody().begin(), irMap);

				// Done.
				rewriter.setInsertionPointAfter(forOp);
				rewriter.create<gpu::ReturnOp>(gpuFunc->getLoc());
				}

				//===----------------------------------------------------------------------===//
				// Rewriting rules.
				//===----------------------------------------------------------------------===//

				/// Proof-of-concept rewriter. This rule generates a CUDA implementation
				/// for each outermost forall loop generated by the sparse compiler.
				//
				// TODO: right works with parallelization-strategy=dense-outer-loop
				// but give this its own flags in the future
				//
				struct ForallRewriter : public OpRewritePattern<scf::ParallelOp> {
				using OpRewritePattern<scf::ParallelOp>::OpRewritePattern;

				ForallRewriter(MLIRContext *context, unsigned nT)
				: OpRewritePattern(context), numThreads(nT){};

				LogicalResult matchAndRewrite(scf::ParallelOp forallOp,
				PatternRewriter &rewriter) const override {
				// Reject inadmissible loop form.
				// Essentially only accept a loop, generated by the sparse compiler,
				// of the form
				// forall (i = 0; i < N; i++)
				// so that cyclic scheduling over the threads is easy.
				if (!forallOp->hasAttr(LoopEmitter::getLoopEmitterLoopAttrName()) \|\|
				forallOp.getNumReductions() != 0 \|\| forallOp.getNumLoops() != 1 \|\|
				!matchPattern(forallOp.getLowerBound()[0], m_Zero()) \|\|
				PeimingUnsubmitted Not Done Reply Inline Actions We should be able to support reduction, right? By rewriting parallel reduce => gpu._reduce maybe? Maybe it is a future work to you :-) Peiming:* We should be able to support reduction, right? By rewriting parallel reduce => gpu.*_reduce…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Agreed. CUDA has some very nifty reduction primitives we should use, but yeah, all future work. The first step, for now, is to get the pipeline working. This revision CHECK test. After that, an end-to-end test. Once we have the building blocks more or less in place, the fun starts! (also, perhaps we don't want a forall rewriter, but bake this directly into loop emitter; I am not sure yet; the idioms, like 2:4 will need work along either path) aartbik: Agreed. CUDA has some very nifty reduction primitives we should use, but yeah, all future work.
				!matchPattern(forallOp.getStep()[0], m_One()))
				return failure();
				PeimingUnsubmitted Done Reply Inline Actions Seems that we should be able support this too by tweaking the thread mapping? (though parallal for generated by sparse compile will always use 0 as the lower bound and 1 as the step). Peiming: Seems that we should be able support this too by tweaking the thread mapping? (though parallal…
				aartbikAuthorUnsubmitted Done Reply Inline Actions Yeah, agreed. I kept the computation simple (for now),but we really should even be able to support i = lo; i < hi; i+= step) eventually with some more work aartbik: Yeah, agreed. I kept the computation simple (for now),but we really should even be able to…
				// Collect every value that is computed outside the parallel loop.
				SetVector<Value> invariants; // stable iteration!
				forallOp->walk([&](Operation *op) {
				// Collect all values of admissible ops.
				for (OpOperand &o : op->getOpOperands()) {
				Value val = o.get();
				Block *block;
				if (auto arg = val.dyn_cast<BlockArgument>())
				block = arg.getOwner();
				else
				block = val.getDefiningOp()->getBlock();
				if (!isNestedIn(block, forallOp))
				invariants.insert(val);
				}
				});
				// Outline the outside values as proper parameters. Fail when sharing
				// value between host and device is not straightforward.
				SmallVector<Value> constants;
				SmallVector<Value> scalars;
				SmallVector<Value> buffers;
				for (Value val : invariants) {
				Type tp = val.getType();
				if (val.getDefiningOp<arith::ConstantOp>())
				constants.push_back(val);
				else if (tp.isa<FloatType>() \|\| tp.isIntOrIndex())
				scalars.push_back(val);
				else if (isa<MemRefType>(tp))
				buffers.push_back(val);
				else
				return failure(); // don't know how to share
				}
				// Prepare the outlined arguments, register buffers.
				Location loc = forallOp->getLoc();
				SmallVector<Value> args;
				for (Value s : scalars)
				args.push_back(s);
				for (Value b : buffers)
				args.push_back(genHostRegisterMemref(rewriter, loc, b));
				auto saveIp = rewriter.saveInsertionPoint();
				// Set up GPU module and construct GPU function.
				//
				// TODO: only generate once, avoid name conflict
				//
				ModuleOp topModule = forallOp->getParentOfType<ModuleOp>();
				auto gpuModule = genGPUModule(rewriter, topModule, "sparsekernels");
				auto gpuFunc = genGPUFunc(rewriter, gpuModule, "kernel", args);
				genGPUCode(rewriter, gpuFunc, forallOp, constants, scalars, buffers);
				// Generate code that launches the kernel.
				rewriter.restoreInsertionPoint(saveIp);
				genLaunchGPUFunc(rewriter, gpuFunc, args, numThreads);
				rewriter.eraseOp(forallOp);
				return success();
				}

				private:
				// Helper method to see if block appears in given loop.
				static bool isNestedIn(Block *block, scf::ParallelOp forallOp) {
				for (Operation *o = block->getParentOp(); o; o = o->getParentOp()) {
				if (o == forallOp)
				return true;
				}
				return false;
				}

				unsigned numThreads;
				};

				} // namespace

				//===----------------------------------------------------------------------===//
				// Public method for populating GPU rewriting rules.
				//===----------------------------------------------------------------------===//

				void mlir::populateSparseGPUCodegenPatterns(RewritePatternSet &patterns,
				unsigned numThreads) {
				patterns.add<ForallRewriter>(patterns.getContext(), numThreads);
				}

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp

//===- SparseTensorPasses.cpp - Pass for autogen sparse tensor code -------===//		//===- SparseTensorPasses.cpp - Pass for autogen sparse tensor code -------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/Affine/IR/AffineOps.h"		#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/IR/Arith.h"		#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/Bufferization/IR/Bufferization.h"		#include "mlir/Dialect/Bufferization/IR/Bufferization.h"
#include "mlir/Dialect/Complex/IR/Complex.h"		#include "mlir/Dialect/Complex/IR/Complex.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/Func/Transforms/FuncConversions.h"		#include "mlir/Dialect/Func/Transforms/FuncConversions.h"
		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/SCF/Transforms/Transforms.h"		#include "mlir/Dialect/SCF/Transforms/Transforms.h"
#include "mlir/Dialect/SparseTensor/IR/SparseTensor.h"		#include "mlir/Dialect/SparseTensor/IR/SparseTensor.h"
#include "mlir/Dialect/SparseTensor/Transforms/Passes.h"		#include "mlir/Dialect/SparseTensor/Transforms/Passes.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_PRESPARSIFICATIONREWRITE		#define GEN_PASS_DEF_PRESPARSIFICATIONREWRITE
#define GEN_PASS_DEF_SPARSIFICATIONPASS		#define GEN_PASS_DEF_SPARSIFICATIONPASS
#define GEN_PASS_DEF_POSTSPARSIFICATIONREWRITE		#define GEN_PASS_DEF_POSTSPARSIFICATIONREWRITE
#define GEN_PASS_DEF_SPARSETENSORCONVERSIONPASS		#define GEN_PASS_DEF_SPARSETENSORCONVERSIONPASS
#define GEN_PASS_DEF_SPARSETENSORCODEGEN		#define GEN_PASS_DEF_SPARSETENSORCODEGEN
#define GEN_PASS_DEF_SPARSEBUFFERREWRITE		#define GEN_PASS_DEF_SPARSEBUFFERREWRITE
#define GEN_PASS_DEF_SPARSEVECTORIZATION		#define GEN_PASS_DEF_SPARSEVECTORIZATION
		#define GEN_PASS_DEF_SPARSEGPUCODEGEN
#define GEN_PASS_DEF_STORAGESPECIFIERTOLLVM		#define GEN_PASS_DEF_STORAGESPECIFIERTOLLVM
#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"		#include "mlir/Dialect/SparseTensor/Transforms/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;
using namespace mlir::sparse_tensor;		using namespace mlir::sparse_tensor;

namespace {		namespace {
▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines	void runOnOperation() override {
RewritePatternSet patterns(ctx);		RewritePatternSet patterns(ctx);
populateSparseVectorizationPatterns(		populateSparseVectorizationPatterns(
patterns, vectorLength, enableVLAVectorization, enableSIMDIndex32);		patterns, vectorLength, enableVLAVectorization, enableSIMDIndex32);
vector::populateVectorToVectorCanonicalizationPatterns(patterns);		vector::populateVectorToVectorCanonicalizationPatterns(patterns);
(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));		(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
}		}
};		};

		struct SparseGPUCodegenPass
		: public impl::SparseGPUCodegenBase<SparseGPUCodegenPass> {

		SparseGPUCodegenPass() = default;
		SparseGPUCodegenPass(const SparseGPUCodegenPass &pass) = default;
		SparseGPUCodegenPass(unsigned nT) { numThreads = nT; }

		void runOnOperation() override {
		auto *ctx = &getContext();
		RewritePatternSet patterns(ctx);
		populateSparseGPUCodegenPatterns(patterns, numThreads);
		(void)applyPatternsAndFoldGreedily(getOperation(), std::move(patterns));
		}
		};

struct StorageSpecifierToLLVMPass		struct StorageSpecifierToLLVMPass
: public impl::StorageSpecifierToLLVMBase<StorageSpecifierToLLVMPass> {		: public impl::StorageSpecifierToLLVMBase<StorageSpecifierToLLVMPass> {

StorageSpecifierToLLVMPass() = default;		StorageSpecifierToLLVMPass() = default;

void runOnOperation() override {		void runOnOperation() override {
auto *ctx = &getContext();		auto *ctx = &getContext();
ConversionTarget target(*ctx);		ConversionTarget target(*ctx);
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines
std::unique_ptr<Pass>		std::unique_ptr<Pass>
mlir::createSparseVectorizationPass(unsigned vectorLength,		mlir::createSparseVectorizationPass(unsigned vectorLength,
bool enableVLAVectorization,		bool enableVLAVectorization,
bool enableSIMDIndex32) {		bool enableSIMDIndex32) {
return std::make_unique<SparseVectorizationPass>(		return std::make_unique<SparseVectorizationPass>(
vectorLength, enableVLAVectorization, enableSIMDIndex32);		vectorLength, enableVLAVectorization, enableSIMDIndex32);
}		}

		std::unique_ptr<Pass> mlir::createSparseGPUCodegenPass() {
		return std::make_unique<SparseGPUCodegenPass>();
		}

		std::unique_ptr<Pass> mlir::createSparseGPUCodegenPass(unsigned numThreads) {
		return std::make_unique<SparseGPUCodegenPass>(numThreads);
		}

std::unique_ptr<Pass> mlir::createStorageSpecifierToLLVMPass() {		std::unique_ptr<Pass> mlir::createStorageSpecifierToLLVMPass() {
return std::make_unique<StorageSpecifierToLLVMPass>();		return std::make_unique<StorageSpecifierToLLVMPass>();
}		}

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul.mlir

This file was added.

				// RUN: mlir-opt %s --linalg-generalize-named-ops \
				// RUN: --pre-sparsification-rewrite \
				// RUN: --sparsification="parallelization-strategy=dense-outer-loop" \
				// RUN: --sparse-gpu-codegen \| FileCheck %s

				#CSR = #sparse_tensor.encoding<{ dimLevelType = [ "dense", "compressed" ] }>

				//
				// Compute matrix matrix C = AB
				//
				// CHECK-LABEL: gpu.func @kernel(
				// CHECK-SAME: %[[VAL_0:.*0]]: index,
				// CHECK-SAME: %[[VAL_1:.*1]]: index,
				// CHECK-SAME: %[[VAL_2:.*2]]: memref<?xindex>,
				// CHECK-SAME: %[[VAL_3:.*3]]: memref<?xindex>,
				// CHECK-SAME: %[[VAL_4:.*4]]: memref<?xf64>,
				// CHECK-SAME: %[[VAL_5:.*5]]: memref<?x?xf64>,
				// CHECK-SAME: %[[VAL_6:.*6]]: memref<?x?xf64>) kernel {
				// CHECK: %[[VAL_7:.*]] = arith.constant 1 : index
				// CHECK: %[[VAL_8:.*]] = arith.constant 0 : index
				// CHECK: %[[VAL_9:.*]] = gpu.block_id x
				// CHECK: %[[VAL_10:.*]] = gpu.block_dim x
				// CHECK: %[[VAL_11:.*]] = gpu.thread_id x
				// CHECK: %[[VAL_12:.*]] = gpu.grid_dim x
				// CHECK: %[[VAL_13:.*]] = arith.muli %[[VAL_9]], %[[VAL_10]] : index
				// CHECK: %[[VAL_14:.*]] = arith.addi %[[VAL_13]], %[[VAL_11]] : index
				// CHECK: %[[VAL_15:.*]] = arith.muli %[[VAL_10]], %[[VAL_12]] : index
				// CHECK: scf.for %[[VAL_16:.*]] = %[[VAL_14]] to %[[VAL_1]] step %[[VAL_15]] {
				// CHECK: %[[VAL_17:.*]] = memref.load %[[VAL_2]]{{\[}}%[[VAL_16]]] : memref<?xindex>
				// CHECK: %[[VAL_18:.*]] = arith.addi %[[VAL_16]], %[[VAL_7]] : index
				// CHECK: %[[VAL_19:.*]] = memref.load %[[VAL_2]]{{\[}}%[[VAL_18]]] : memref<?xindex>
				// CHECK: scf.for %[[VAL_20:.*]] = %[[VAL_17]] to %[[VAL_19]] step %[[VAL_7]] {
				// CHECK: %[[VAL_21:.*]] = memref.load %[[VAL_3]]{{\[}}%[[VAL_20]]] : memref<?xindex>
				// CHECK: %[[VAL_22:.*]] = memref.load %[[VAL_4]]{{\[}}%[[VAL_20]]] : memref<?xf64>
				// CHECK: scf.for %[[VAL_23:.*]] = %[[VAL_8]] to %[[VAL_0]] step %[[VAL_7]] {
				// CHECK: %[[VAL_24:.*]] = memref.load %[[VAL_5]]{{\[}}%[[VAL_16]], %[[VAL_23]]] : memref<?x?xf64>
				// CHECK: %[[VAL_25:.*]] = memref.load %[[VAL_6]]{{\[}}%[[VAL_21]], %[[VAL_23]]] : memref<?x?xf64>
				// CHECK: %[[VAL_26:.*]] = arith.mulf %[[VAL_22]], %[[VAL_25]] : f64
				// CHECK: %[[VAL_27:.*]] = arith.addf %[[VAL_24]], %[[VAL_26]] : f64
				// CHECK: memref.store %[[VAL_27]], %[[VAL_5]]{{\[}}%[[VAL_16]], %[[VAL_23]]] : memref<?x?xf64>
				// CHECK: } {"Emitted from" = "linalg.generic"}
				// CHECK: } {"Emitted from" = "linalg.generic"}
				// CHECK: }
				// CHECK: gpu.return
				// CHECK: }
				//
				//
				// CHECK-LABEL: func.func @matmul
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.launch_func @sparsekernels::@kernel blocks
				//
				func.func @matmul(%A: tensor<?x?xf64, #CSR>, %B: tensor<?x?xf64>, %C_in: tensor<?x?xf64>) -> tensor<?x?xf64> {
				%C_out = linalg.matmul
				ins(%A, %B: tensor<?x?xf64, #CSR>, tensor<?x?xf64>)
				outs(%C_in: tensor<?x?xf64>) -> tensor<?x?xf64>
				return %C_out : tensor<?x?xf64>
				}

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec.mlir

This file was added.

				// RUN: mlir-opt %s --linalg-generalize-named-ops \
				// RUN: --pre-sparsification-rewrite \
				// RUN: --sparsification="parallelization-strategy=dense-outer-loop" \
				// RUN: --sparse-gpu-codegen \| FileCheck %s

				#CSR = #sparse_tensor.encoding<{ dimLevelType = [ "dense", "compressed" ] }>

				//
				// Compute matrix vector y = Ax
				//
				//
				// CHECK: gpu.func @kernel(
				// CHECK-SAME: %[[VAL_0:.*0]]: index,
				// CHECK-SAME: %[[VAL_1:.*1]]: memref<?xf64>,
				// CHECK-SAME: %[[VAL_2:.*2]]: memref<?xindex>,
				// CHECK-SAME: %[[VAL_3:.*3]]: memref<?xindex>,
				// CHECK-SAME: %[[VAL_4:.*4]]: memref<?xf64>,
				// CHECK-SAME: %[[VAL_5:.*5]]: memref<?xf64>) kernel {
				// CHECK: %[[VAL_6:.*]] = arith.constant 1 : index
				// CHECK: %[[VAL_7:.*]] = gpu.block_id x
				// CHECK: %[[VAL_8:.*]] = gpu.block_dim x
				// CHECK: %[[VAL_9:.*]] = gpu.thread_id x
				// CHECK: %[[VAL_10:.*]] = gpu.grid_dim x
				// CHECK: %[[VAL_11:.*]] = arith.muli %[[VAL_7]], %[[VAL_8]] : index
				// CHECK: %[[VAL_12:.*]] = arith.addi %[[VAL_11]], %[[VAL_9]] : index
				// CHECK: %[[VAL_13:.*]] = arith.muli %[[VAL_8]], %[[VAL_10]] : index
				// CHECK: scf.for %[[VAL_14:.*]] = %[[VAL_12]] to %[[VAL_0]] step %[[VAL_13]] {
				// CHECK: %[[VAL_15:.*]] = memref.load %[[VAL_1]]{{\[}}%[[VAL_14]]] : memref<?xf64>
				// CHECK: %[[VAL_16:.*]] = memref.load %[[VAL_2]]{{\[}}%[[VAL_14]]] : memref<?xindex>
				// CHECK: %[[VAL_17:.*]] = arith.addi %[[VAL_14]], %[[VAL_6]] : index
				// CHECK: %[[VAL_18:.*]] = memref.load %[[VAL_2]]{{\[}}%[[VAL_17]]] : memref<?xindex>
				// CHECK: %[[VAL_19:.]] = scf.for %[[VAL_20:.]] = %[[VAL_16]] to %[[VAL_18]] step %[[VAL_6]] iter_args(%[[VAL_21:.*]] = %[[VAL_15]]) -> (f64) {
				// CHECK: %[[VAL_22:.*]] = memref.load %[[VAL_3]]{{\[}}%[[VAL_20]]] : memref<?xindex>
				// CHECK: %[[VAL_23:.*]] = memref.load %[[VAL_4]]{{\[}}%[[VAL_20]]] : memref<?xf64>
				// CHECK: %[[VAL_24:.*]] = memref.load %[[VAL_5]]{{\[}}%[[VAL_22]]] : memref<?xf64>
				// CHECK: %[[VAL_25:.*]] = arith.mulf %[[VAL_23]], %[[VAL_24]] : f64
				// CHECK: %[[VAL_26:.*]] = arith.addf %[[VAL_21]], %[[VAL_25]] : f64
				// CHECK: scf.yield %[[VAL_26]] : f64
				// CHECK: } {"Emitted from" = "linalg.generic"}
				// CHECK: memref.store %[[VAL_27:.*]], %[[VAL_1]]{{\[}}%[[VAL_14]]] : memref<?xf64>
				// CHECK: }
				// CHECK: gpu.return
				// CHECK: }
				//
				// CHECK-LABEL: func.func @matvec
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.host_register
				// CHECK: gpu.launch_func @sparsekernels::@kernel blocks
				//
				func.func @matvec(%A: tensor<?x?xf64, #CSR>, %x: tensor<?xf64>, %y_in: tensor<?xf64>) -> tensor<?xf64> {
				%y_out = linalg.matvec
				ins(%A, %x: tensor<?x?xf64, #CSR>, tensor<?xf64>)
				outs(%y_in: tensor<?xf64>) -> tensor<?xf64>
				return %y_out : tensor<?xf64>
				}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,337 Lines • ▼ Show 20 Lines	deps = [
":ArithDialect",		":ArithDialect",
":ArithUtils",		":ArithUtils",
":BufferizationDialect",		":BufferizationDialect",
":BufferizationTransforms",		":BufferizationTransforms",
":ComplexDialect",		":ComplexDialect",
":DialectUtils",		":DialectUtils",
":FuncDialect",		":FuncDialect",
":FuncTransforms",		":FuncTransforms",
		":GPUDialect",
":IR",		":IR",
":LLVMCommonConversion",		":LLVMCommonConversion",
":LLVMDialect",		":LLVMDialect",
":LinalgDialect",		":LinalgDialect",
":LinalgTransforms",		":LinalgTransforms",
":LinalgUtils",		":LinalgUtils",
":MathDialect",		":MathDialect",
":MemRefDialect",		":MemRefDialect",
▲ Show 20 Lines • Show All 8,600 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][gpu] a first prototype sparse GPU code generatorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 511170

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.h

mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td

mlir/lib/Dialect/SparseTensor/Transforms/CMakeLists.txt

mlir/lib/Dialect/SparseTensor/Transforms/SparseGPUCodegen.cpp

mlir/lib/Dialect/SparseTensor/Transforms/SparseTensorPasses.cpp

mlir/test/Dialect/SparseTensor/GPU/gpu_matmul.mlir

mlir/test/Dialect/SparseTensor/GPU/gpu_matvec.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

[mlir][sparse][gpu] a first prototype sparse GPU code generator
ClosedPublic