This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/Transforms/
-
mlir/
-
Dialect/
-
GPU/
-
Transforms/
3/4
Passes.h
-
lib/Dialect/
-
Dialect/
-
GPU/Transforms/
-
Transforms/
1
SerializeToCubin.cpp
-
SparseTensor/Pipelines/
-
Pipelines/
-
SparseTensorPipelines.cpp
-
test/lib/Dialect/GPU/
-
lib/
-
Dialect/
-
GPU/
-
TestLowerToNVVM.cpp

Differential D155563

[mlir][gpu] Improving Cubin Serialization with ptxas Compiler
ClosedPublic

Authored by guraypp on Jul 18 2023, 1:07 AM.

Download Raw Diff

Details

Reviewers

qcolombet
nicolasvasilache
bondhugula
ThomasRaoux
herhut
aartbik

Commits

rG585cbe3f6397: [mlir][gpu] Improving Cubin Serialization with ptxas Compiler

Summary

This work improves how we compile the generated PTX code using the ptxas compiler. Currently, we rely on the driver's jit API to compile the PTX code. However, this approach has some limitations. It doesn't always produce the same binary output as the ptxas compiler, leading to potential inconsistencies in the generated Cubin files.

This work introduces a significant improvement by directly utilizing the ptxas compiler for PTX compilation. By doing so, we can achieve more consistent and reliable results in generating cubin files. Key Benefits:

Using the Ptxas compiler directly ensures that the cubin files generated during the build process remain consistent with CUDA compilation using nvcc or clang.
Another advantage of this work is that it allows developers to experiment with different ptxas compilers without the need to change the compiler. Performance among ptxas compiler versions are vary, therefore, one can easily try different ptxas compilers.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Jul 18 2023, 1:07 AM

Herald added a reviewer: bondhugula. · View Herald TranscriptJul 18 2023, 1:07 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 25 others. · View Herald Transcript

guraypp requested review of this revision.Jul 18 2023, 1:07 AM

Herald added a reviewer: herhut. · View Herald TranscriptJul 18 2023, 1:07 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

nicolasvasilache added inline comments.Jul 18 2023, 1:42 AM

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176	this now has enough magic flags that you may want an options struct with named fields to hold the information.

Harbormaster completed remote builds in B246117: Diff 541380.Jul 18 2023, 5:25 AM

guraypp added inline comments.Jul 19 2023, 12:06 AM

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176	Right, we have many flags here, we might even have more since ptx compilation is done by another compiler. How does the options struct look like? Do we have a example?

nicolasvasilache added inline comments.Jul 19 2023, 1:18 AM

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176	Yes, quite a few actually. It depends how much you want to be configurable with CLI options. You could look up `ConvertFuncToLLVMPass` which has a `let options =` and how it is used in https://reviews.llvm.org/D155463. Alternatively, you could also create your own struct manually, there are examples in Dialect/Linalg/Transforms/Transforms.h

guraypp mentioned this in D155838: [mlir] Nvidia Hopper TMA load integration test.Jul 20 2023, 6:40 AM

Approving conditioned on better options struct to avoid proliferation of flags.

mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp
228	spurious format change

This revision is now accepted and ready to land.Jul 20 2023, 11:57 PM

add parameters for ptxas, and rebase

Harbormaster completed remote builds in B247257: Diff 542968.Jul 21 2023, 2:01 PM

use struct for the options

Herald added a reviewer: aartbik. · View Herald TranscriptJul 24 2023, 12:40 AM

Herald added subscribers: hanchung, jsetoain, anlunx, jholewinski. · View Herald Transcript

Harbormaster completed remote builds in B247579: Diff 543405.Jul 24 2023, 12:56 AM

Closed by commit rG585cbe3f6397: [mlir][gpu] Improving Cubin Serialization with ptxas Compiler (authored by guraypp). · Explain WhyJul 24 2023, 3:30 AM

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG585cbe3f6397: [mlir][gpu] Improving Cubin Serialization with ptxas Compiler.

guraypp mentioned this in D156098: [mlir][gpu] Increase default SM version from 35 to 50.Jul 24 2023, 3:51 AM

guraypp mentioned this in rGa6eb40692c79: [mlir][gpu] Increase default SM version from 35 to 50.Jul 24 2023, 6:11 AM

mehdi_amini added a reverting change: rG5e8a1164f227: Revert "[mlir][gpu] Fallback to JIT compilation" "[mlir][gpu] Increase default….Jul 24 2023, 10:23 AM

The bot was broken here: https://lab.llvm.org/buildbot/#/builders/61/builds/46498

Also I have concern with the patch in the first place: the previous plan on Discourse was to use the cuda toolkit API instead of shelling out to the ptxas binary.

See also the plans for changing the entire GPU-to-LLVM lowering path here: https://reviews.llvm.org/D154153 (series of patch).

Lot of context here: https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/

Thanks @mehdi_amini for addressing the issue promptly. I made two additional fixes after seeing builtbot failures. I thought they fixed it. Where can I see the latest failures?

In D155563#4529011, @mehdi_amini wrote:

Also I have concern with the patch in the first place: the previous plan on Discourse was to use the cuda toolkit API instead of shelling out to the ptxas binary.

I'm happy to discuss it at the discourse, but I thought this work would make everyone happy. Let me explain the need here, if it's going to be long, we can continue the discourse.

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:

The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.

It is not easy to find an answer when you hit one of these problems. SASS is not documented.

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

See also the plans for changing the entire GPU-to-LLVM lowering path here: https://reviews.llvm.org/D154153 (series of patch).

Lot of context here: https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/

That's cool. Current GPU pipeline has some shortcomings. I am very interested every aspects of GPUs, please add me as reviewers when you have something new, so I can take a look.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176	I still did not understand what kind of struct I need to use. There is td file of this pass, I cannot do `let options`. I think I better land this and implement the struct as a next PR.

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:
The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.
It is not easy to find an answer when you hit one of these problems. SASS is not documented.

Sure: but aren't ptxas and nvrtc both just distributed with the cuda toolkit, from the point of view you mentioned above what is the difference to use one or the other?

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

Actually it does only if you ask it for right? Otherwise it embeds PTX?

In D155563#4537455, @mehdi_amini wrote:

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:
The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.
It is not easy to find an answer when you hit one of these problems. SASS is not documented.

Sure: but aren't ptxas and nvrtc both just distributed with the cuda toolkit, from the point of view you mentioned above what is the difference to use one or the other?

I have extracted the relevant codes for you to review. The link below contains the ptxas generated SASS code on the left side, the driver-generated SASS code on the right side, and the PTX code used for both cases in the box at the bottom. Unfortunately, the program compiled with the driver crashes with a Warp Misaligned Address error. I suspect that the issue may be related to the UTMLDG.2D instruction, but it's hard to say. I am unable to debug this further. The other program work as expected.
https://godbolt.org/z/6onxv6enz

I'm uncertain about whether ptxas and the CUDA driver's compiler consistently generate the same SASS code, as there has been no confirmation on this matter in the past.

Nevertheless, to maintain our sanity and ensure consistent performance, I strongly advocate selecting ptxas. It's really sad to see a 10% performance drop, even when generating identical PTX code, simply due to someone installed a new CUDA driver :(

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

Actually it does only if you ask it for right? Otherwise it embeds PTX?

Try nvcc -v code.cu, you will see ptxas in the compilation flow.

Additionally, I want to clarify that I am not suggesting the removal of driver compilation. Instead, I propose adding ptxas compilation as an option and a flag. This way, anyone can choose to opt-in or out based on their specific requirements and preferences.

nicolasvasilache mentioned this in rGca74ad88295f: [mlir] Nvidia Hopper TMA load integration test.Jul 28 2023, 2:06 PM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

Transforms/

Passes.h

42 lines

lib/

Dialect/

GPU/

Transforms/

SerializeToCubin.cpp

171 lines

SparseTensor/

Pipelines/

SparseTensorPipelines.cpp

8 lines

test/

lib/

Dialect/

GPU/

TestLowerToNVVM.cpp

8 lines

Diff 543449

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
/// Collect all patterns to rewrite ops within the GPU dialect.		/// Collect all patterns to rewrite ops within the GPU dialect.
inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {		inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {
populateGpuAllReducePatterns(patterns);		populateGpuAllReducePatterns(patterns);
populateGpuGlobalIdPatterns(patterns);		populateGpuGlobalIdPatterns(patterns);
populateGpuShufflePatterns(patterns);		populateGpuShufflePatterns(patterns);
}		}

namespace gpu {		namespace gpu {

		/// Options for Serialization
		struct SerializationToCubinOptions {
		/// LLVM target triple
		std::string triple;

		/// SM Architecture of the GPU
		std::string chip;

		/// PTX version that is wanted to produce
		std::string features;

		/// Optimization level
		int optLevel = 2;

		/// Dump generated PTX to stderr for debug purposes
		bool dumpPtx = false;

		/// Compiles generated PTX by ptxas compiler. When it is false, the generated
		/// PTX is compilet by JIT compielr by the driver.
		bool usePtxas = true;

		/// Parameters to pass ptxas compiler. It is ignored for JIT compiler.
		std::string ptxasParams;
		};

/// Base pass class to serialize kernel functions through LLVM into		/// Base pass class to serialize kernel functions through LLVM into
/// user-specified IR and add the resulting blob as module attribute.		/// user-specified IR and add the resulting blob as module attribute.
class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {		class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {
public:		public:
SerializeToBlobPass(TypeID passID);		SerializeToBlobPass(TypeID passID);
SerializeToBlobPass(const SerializeToBlobPass &other);		SerializeToBlobPass(const SerializeToBlobPass &other);

void runOnOperation() final;		void runOnOperation() final;
Show All 31 Lines	Option<std::string> features{*this, "features",
::llvm::cl::desc("Target features")};		::llvm::cl::desc("Target features")};
Option<int> optLevel{*this, "opt-level",		Option<int> optLevel{*this, "opt-level",
llvm::cl::desc("Optimization level for compilation"),		llvm::cl::desc("Optimization level for compilation"),
llvm::cl::init(2)};		llvm::cl::init(2)};
Option<std::string> gpuBinaryAnnotation{		Option<std::string> gpuBinaryAnnotation{
*this, "gpu-binary-annotation",		*this, "gpu-binary-annotation",
llvm::cl::desc("Annotation attribute string for GPU binary"),		llvm::cl::desc("Annotation attribute string for GPU binary"),
llvm::cl::init(getDefaultGpuBinaryAnnotation())};		llvm::cl::init(getDefaultGpuBinaryAnnotation())};

Option<bool> dumpPtx{*this, "dump-ptx",		Option<bool> dumpPtx{*this, "dump-ptx",
::llvm::cl::desc("Dump generated PTX"),		::llvm::cl::desc("Dump generated PTX"),
llvm::cl::init(false)};		llvm::cl::init(false)};

		Option<bool> usePtxas{
		*this, "use-ptxas",
		::llvm::cl::desc("Compile generated PTX by ptxas compiler"),
		llvm::cl::init(true)};
		Option<std::string> ptxasParams{
		*this, "ptxas-params",
		::llvm::cl::desc("Parameters to pass ptxas compiler")};
};		};
} // namespace gpu		} // namespace gpu

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Registration		// Registration
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Register pass to serialize GPU kernel functions to a CUBIN binary		/// Register pass to serialize GPU kernel functions to a CUBIN binary
/// annotation.		/// annotation.
void registerGpuSerializeToCubinPass();		void registerGpuSerializeToCubinPass();

/// Register pass to serialize GPU kernel functions to a HSAco binary		/// Register pass to serialize GPU kernel functions to a HSAco binary
/// annotation.		/// annotation.
void registerGpuSerializeToHsacoPass();		void registerGpuSerializeToHsacoPass();

/// Create an instance of the GPU kernel function to CUBIN binary serialization		/// Create an instance of the GPU kernel function to CUBIN binary serialization
/// pass with optLevel (default level 2).		/// pass with optLevel (default level 2).
std::unique_ptr<Pass> createGpuSerializeToCubinPass(StringRef triple,		std::unique_ptr<Pass>
StringRef chip,		createGpuSerializeToCubinPass(const gpu::SerializationToCubinOptions &options);
		nicolasvasilacheUnsubmitted Done Reply Inline Actions this now has enough magic flags that you may want an options struct with named fields to hold the information. nicolasvasilache: this now has enough magic flags that you may want an options struct with named fields to hold…
		gurayppAuthorUnsubmitted Done Reply Inline Actions Right, we have many flags here, we might even have more since ptx compilation is done by another compiler. How does the options struct look like? Do we have a example? guraypp: Right, we have many flags here, we might even have more since ptx compilation is done by…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Yes, quite a few actually. It depends how much you want to be configurable with CLI options. You could look up `ConvertFuncToLLVMPass` which has a `let options =` and how it is used in https://reviews.llvm.org/D155463. Alternatively, you could also create your own struct manually, there are examples in Dialect/Linalg/Transforms/Transforms.h nicolasvasilache: Yes, quite a few actually. It depends how much you want to be configurable with CLI options.
		gurayppAuthorUnsubmitted Done Reply Inline Actions I still did not understand what kind of struct I need to use. There is td file of this pass, I cannot do `let options`. I think I better land this and implement the struct as a next PR. guraypp: I still did not understand what kind of struct I need to use. There is td file of this pass, I…
StringRef features,
int optLevel = 2,
bool dumpPtx = false);

/// Create an instance of the GPU kernel function to HSAco binary serialization		/// Create an instance of the GPU kernel function to HSAco binary serialization
/// pass.		/// pass.
std::unique_ptr<Pass> createGpuSerializeToHsacoPass(StringRef triple,		std::unique_ptr<Pass> createGpuSerializeToHsacoPass(StringRef triple,
StringRef arch,		StringRef arch,
StringRef features,		StringRef features,
int optLevel);		int optLevel);

/// Generate the code for registering passes.		/// Generate the code for registering passes.
#define GEN_PASS_REGISTRATION		#define GEN_PASS_REGISTRATION
#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"		#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"

} // namespace mlir		} // namespace mlir

#endif // MLIR_DIALECT_GPU_TRANSFORMS_PASSES_H_		#endif // MLIR_DIALECT_GPU_TRANSFORMS_PASSES_H_

mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp

	//===- LowerGPUToCUBIN.cpp - Convert GPU kernel to CUBIN blob -------------===//			//===- LowerGPUToCUBIN.cpp - Convert GPU kernel to CUBIN blob -------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements a pass that serializes a gpu module into CUBIN blob and			// This file implements a pass that serializes a gpu module into CUBIN blob and
	// adds that blob as a string attribute of the module.			// adds that blob as a string attribute of the module.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/GPU/Transforms/Passes.h"			#include "mlir/Dialect/GPU/Transforms/Passes.h"
	#include "llvm/Support/Debug.h"			#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/FileUtilities.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include "llvm/Support/Process.h"
				#include "llvm/Support/Program.h"
				#include "llvm/Support/WithColor.h"
				#include "llvm/Support/raw_ostream.h"

	#if MLIR_GPU_TO_CUBIN_PASS_ENABLE			#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"
	#include "mlir/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.h"			#include "mlir/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.h"
	#include "mlir/Target/LLVMIR/Export.h"			#include "mlir/Target/LLVMIR/Export.h"
	#include "llvm/Support/TargetSelect.h"			#include "llvm/Support/TargetSelect.h"
	#include "llvm/Support/Threading.h"			#include "llvm/Support/Threading.h"

	#include <cuda.h>			#include <cuda.h>

	using namespace mlir;			using namespace mlir;

	static void emitCudaError(const llvm::Twine &expr, const char *buffer,			static void emitCudaError(const llvm::Twine &expr, const char *buffer,
	CUresult result, Location loc) {			CUresult result, Location loc) {
	const char *error;			const char *error;
	cuGetErrorString(result, &error);			cuGetErrorString(result, &error);
	emitError(loc, expr.concat(" failed with error code ")			emitError(loc, expr.concat(" failed with error code ")
	.concat(llvm::Twine{error})			.concat(llvm::Twine{error})
	.concat("[")			.concat("[")
	.concat(buffer)			.concat(buffer)
	.concat("]"));			.concat("]"));
	}			}

				static constexpr char kPtxasCompilerName[] = "ptxas";

				/// Compiles the given generated PTX code with the given ptxas compiler.
				static FailureOr<std::string>
				compileWithPtxas(StringRef smCapability, StringRef ptxasParams,
				StringRef ptxSource, bool dumpPtx, std::string *message) {
				// Step 0. Find ptxas compiler
				std::optional<std::string> ptxasCompiler =
				llvm::sys::Process::FindInEnvPath("PATH", kPtxasCompilerName);
				if (!ptxasCompiler.has_value())
				return failure();

				// Step 1. Create temporary files: ptx source file, log file and cubin file
				llvm::SmallString<64> ptxSourceFile, stdinFile, stdoutFile, stderrFile;
				llvm::sys::fs::createTemporaryFile("mlir-ptx", "", ptxSourceFile);
				llvm::sys::fs::createTemporaryFile("ptxas-stdin", "", stdinFile);
				llvm::sys::fs::createTemporaryFile("ptxas-stdout", "", stdoutFile);
				llvm::sys::fs::createTemporaryFile("ptxas-stderr", "", stderrFile);
				std::string cubinFile = std::string(ptxSourceFile) + ".cubin";
				llvm::FileRemover stdinRemover(stdinFile.c_str());
				llvm::FileRemover stdoutRemover(stdoutFile.c_str());
				llvm::FileRemover stderrRemover(stderrFile.c_str());
				llvm::FileRemover binRemover(cubinFile.c_str());
				llvm::FileRemover srcRemover(ptxSourceFile.c_str());

				// Step 2. Write the generated PTX into a file, so we can pass it to ptxas
				// compiler
				std::error_code ec;
				llvm::raw_fd_ostream fPtxSource(ptxSourceFile, ec);
				fPtxSource << ptxSource;
				fPtxSource.close();
				if (fPtxSource.has_error()) {
				*message = std::string(
				"Could not write the generated ptx into a temporary file\n");
				return failure();
				}

				// Step 3. Build the ptxas command line
				std::vector<StringRef> argVector{StringRef("ptxas"), StringRef("-arch"),
				smCapability, StringRef(ptxSourceFile),
				StringRef("-o"), StringRef(cubinFile)};
				#ifdef _WIN32
				auto tokenize = llvm::cl::TokenizeWindowsCommandLine;
				#else
				auto tokenize = llvm::cl::TokenizeGNUCommandLine;
				#endif // _WIN32
				llvm::BumpPtrAllocator scratchAllocator;
				llvm::StringSaver stringSaver(scratchAllocator);
				SmallVector<const char *> rawArgs;
				tokenize(ptxasParams, stringSaver, rawArgs, /MarkEOLs=/false);
				for (const auto *rawArg : rawArgs)
				argVector.emplace_back(rawArg);

				std::optional<StringRef> redirects[] = {
				stdinFile.str(),
				stdoutFile.str(),
				stderrFile.str(),
				};

				// Step 4. Invoke ptxas
				if (llvm::sys::ExecuteAndWait(ptxasCompiler.value(),
				llvm::ArrayRef<llvm::StringRef>(argVector),
				/Env=/std::nullopt,
				/Redirects=/redirects,
				/SecondsToWait=/0,
				/MemoryLimit=/0,
				/ErrMsg=/message)) {
				if (message->empty()) {
				llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> maybeErrorlog =
				llvm::MemoryBuffer::getFile(stderrFile);
				*message = std::string("Invoking ptxas is failed, see the file: ");
				if (maybeErrorlog)
				*message += maybeErrorlog->get()->getBuffer().str();
				}
				stderrRemover.releaseFile();
				return failure();
				}

				// Step 5. The output of ptxas if verbose flag is set. This is useful
				// because it shows local memory usage, register usage, and etc.
				if (dumpPtx) {
				llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> maybeFlog =
				llvm::MemoryBuffer::getFile(stderrFile);
				if (maybeFlog) {
				llvm::WithColor::note() << maybeFlog->get()->getBuffer().str();
				}
				}

				// Step 6. Read the cubin file, and return. It will eventually be written
				// into executable.
				llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> maybeFcubin =
				llvm::MemoryBuffer::getFile(cubinFile);
				if (!maybeFcubin) {
				*message = std::string("Could not read cubin file \n");
				return failure();
				}

				return std::string(maybeFcubin->get()->getBuffer());
				}

	#define RETURN_ON_CUDA_ERROR(expr) \			#define RETURN_ON_CUDA_ERROR(expr) \
	do { \			do { \
	if (auto status = (expr)) { \			if (auto status = (expr)) { \
	emitCudaError(#expr, jitErrorBuffer, status, loc); \			emitCudaError(#expr, jitErrorBuffer, status, loc); \
	return {}; \			return {}; \
	} \			} \
	} while (false)			} while (false)

	namespace {			namespace {
	class SerializeToCubinPass			class SerializeToCubinPass
	: public PassWrapper<SerializeToCubinPass, gpu::SerializeToBlobPass> {			: public PassWrapper<SerializeToCubinPass, gpu::SerializeToBlobPass> {
	static llvm::once_flag initializeBackendOnce;			static llvm::once_flag initializeBackendOnce;

	public:			public:
	MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(SerializeToCubinPass)			MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(SerializeToCubinPass)

	SerializeToCubinPass(StringRef triple = "nvptx64-nvidia-cuda",			SerializeToCubinPass(StringRef triple = "nvptx64-nvidia-cuda",
	StringRef chip = "sm_35", StringRef features = "+ptx60",			StringRef chip = "sm_35", StringRef features = "+ptx60",
	int optLevel = 2, bool dumpPtx = false);			int optLevel = 2, bool dumpPtx = false,
				bool usePtxas = true, StringRef ptxasParams = {});

	StringRef getArgument() const override { return "gpu-to-cubin"; }			StringRef getArgument() const override { return "gpu-to-cubin"; }
	StringRef getDescription() const override {			StringRef getDescription() const override {
	return "Lower GPU kernel function to CUBIN binary annotations";			return "Lower GPU kernel function to CUBIN binary "
				"annotations";
	}			}

	private:			private:
	void getDependentDialects(DialectRegistry &registry) const override;			void getDependentDialects(DialectRegistry &registry) const override;

	// Serializes PTX to CUBIN.			// Serializes PTX to CUBIN.
	std::unique_ptr<std::vector<char>>			std::unique_ptr<std::vector<char>>
	serializeISA(const std::string &isa) override;			serializeISA(const std::string &isa) override;
	};			};
	} // namespace			} // namespace

	// Sets the 'option' to 'value' unless it already has a value.			// Sets the 'option' to 'value' unless it already has a value.
	static void maybeSetOption(Pass::Option<std::string> &option, StringRef value) {			static void maybeSetOption(Pass::Option<std::string> &option, StringRef value) {
	if (!option.hasValue())			if (!option.hasValue())
	option = value.str();			option = value.str();
	}			}

	llvm::once_flag SerializeToCubinPass::initializeBackendOnce;			llvm::once_flag SerializeToCubinPass::initializeBackendOnce;

	SerializeToCubinPass::SerializeToCubinPass(StringRef triple, StringRef chip,			SerializeToCubinPass::SerializeToCubinPass(StringRef triple, StringRef chip,
	StringRef features, int optLevel,			StringRef features, int optLevel,
	bool dumpPtx) {			bool dumpPtx, bool usePtxas,
	// No matter how this pass is constructed, ensure that the NVPTX backend			StringRef ptxasParams) {
	// is initialized exactly once.			// No matter how this pass is constructed, ensure that
				// the NVPTX backend is initialized exactly once.
	llvm::call_once(initializeBackendOnce, []() {			llvm::call_once(initializeBackendOnce, []() {
	// Initialize LLVM NVPTX backend.			// Initialize LLVM NVPTX backend.
	LLVMInitializeNVPTXTarget();			LLVMInitializeNVPTXTarget();
	LLVMInitializeNVPTXTargetInfo();			LLVMInitializeNVPTXTargetInfo();
	LLVMInitializeNVPTXTargetMC();			LLVMInitializeNVPTXTargetMC();
	LLVMInitializeNVPTXAsmPrinter();			LLVMInitializeNVPTXAsmPrinter();
	});			});

	maybeSetOption(this->triple, triple);			maybeSetOption(this->triple, triple);
	maybeSetOption(this->chip, chip);			maybeSetOption(this->chip, chip);
	maybeSetOption(this->features, features);			maybeSetOption(this->features, features);
				maybeSetOption(this->ptxasParams, ptxasParams);
	this->dumpPtx = dumpPtx;			this->dumpPtx = dumpPtx;
				this->usePtxas = usePtxas;
	if (this->optLevel.getNumOccurrences() == 0)			if (this->optLevel.getNumOccurrences() == 0)
	this->optLevel.setValue(optLevel);			this->optLevel.setValue(optLevel);
	}			}

	void SerializeToCubinPass::getDependentDialects(			void SerializeToCubinPass::getDependentDialects(
	DialectRegistry &registry) const {			DialectRegistry &registry) const {
	registerNVVMDialectTranslation(registry);			registerNVVMDialectTranslation(registry);
	gpu::SerializeToBlobPass::getDependentDialects(registry);			gpu::SerializeToBlobPass::getDependentDialects(registry);
	}			}

	std::unique_ptr<std::vector<char>>			std::unique_ptr<std::vector<char>>
	SerializeToCubinPass::serializeISA(const std::string &isa) {			SerializeToCubinPass::serializeISA(const std::string &isa) {
	Location loc = getOperation().getLoc();			Location loc = getOperation().getLoc();
	char jitErrorBuffer[4096] = {0};			char jitErrorBuffer[4096] = {0};

	RETURN_ON_CUDA_ERROR(cuInit(0));			RETURN_ON_CUDA_ERROR(cuInit(0));

	// Linking requires a device context.			// Linking requires a device
				// context.
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions spurious format change nicolasvasilache: spurious format change
	CUdevice device;			CUdevice device;
	RETURN_ON_CUDA_ERROR(cuDeviceGet(&device, 0));			RETURN_ON_CUDA_ERROR(cuDeviceGet(&device, 0));
	CUcontext context;			CUcontext context;
	RETURN_ON_CUDA_ERROR(cuCtxCreate(&context, 0, device));			RETURN_ON_CUDA_ERROR(cuCtxCreate(&context, 0, device));
	CUlinkState linkState;			CUlinkState linkState;

	CUjit_option jitOptions[] = {CU_JIT_ERROR_LOG_BUFFER,			CUjit_option jitOptions[] = {CU_JIT_ERROR_LOG_BUFFER,
	CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES};			CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES};
	void *jitOptionsVals[] = {jitErrorBuffer,			void *jitOptionsVals[] = {jitErrorBuffer,
	reinterpret_cast<void *>(sizeof(jitErrorBuffer))};			reinterpret_cast<void *>(sizeof(jitErrorBuffer))};

	RETURN_ON_CUDA_ERROR(cuLinkCreate(2, /* number of jit options */			RETURN_ON_CUDA_ERROR(cuLinkCreate(2, /* number of jit options */
	jitOptions, /* jit options */			jitOptions, /* jit options */
	jitOptionsVals, /* jit option values */			jitOptionsVals, /* jit option values */
	&linkState));			&linkState));

	auto kernelName = getOperation().getName().str();			auto kernelName = getOperation().getName().str();
	if (dumpPtx) {			if (dumpPtx) {
	llvm::dbgs() << " Kernel Name : [" << kernelName << "]\n";			llvm::errs() << "// Kernel Name : [" << kernelName << "]\n";
	llvm::dbgs() << isa << "\n";			llvm::errs() << isa << "\n";
				}

				if (usePtxas) {
				// Try to compile it with ptxas first.
				std::string message;
				FailureOr<std::string> maybeCubinImage =
				compileWithPtxas(this->chip, ptxasParams, isa, dumpPtx, &message);
				if (succeeded(maybeCubinImage)) {
				return std::make_unique<std::vector<char>>(
				maybeCubinImage.value().begin(), maybeCubinImage.value().end());
	}			}
				emitError(loc) << message;
				return {};
				}

				// Fallback to JIT compilation if ptxas fails.
	RETURN_ON_CUDA_ERROR(cuLinkAddData(			RETURN_ON_CUDA_ERROR(cuLinkAddData(
	linkState, CUjitInputType::CU_JIT_INPUT_PTX,			linkState, CUjitInputType::CU_JIT_INPUT_PTX,
	const_cast<void >(static_cast<const void >(isa.c_str())), isa.length(),			const_cast<void >(static_cast<const void >(isa.c_str())), isa.length(),
	kernelName.c_str(), 0, /* number of jit options */			kernelName.c_str(), 0, /* number of jit options */
	nullptr, /* jit options */			nullptr, /* jit options */
	nullptr /* jit option values */			nullptr /* jit option values */
	));			));

	void *cubinData;			void *cubinData;
	size_t cubinSize;			size_t cubinSize;
	RETURN_ON_CUDA_ERROR(cuLinkComplete(linkState, &cubinData, &cubinSize));			RETURN_ON_CUDA_ERROR(cuLinkComplete(linkState, &cubinData, &cubinSize));

	char cubinAsChar = static_cast<char >(cubinData);			char cubinAsChar = static_cast<char >(cubinData);
	auto result =			auto result =
	std::make_unique<std::vector<char>>(cubinAsChar, cubinAsChar + cubinSize);			std::make_unique<std::vector<char>>(cubinAsChar, cubinAsChar + cubinSize);

	// This will also destroy the cubin data.			// This will also destroy the cubin data.
	RETURN_ON_CUDA_ERROR(cuLinkDestroy(linkState));			RETURN_ON_CUDA_ERROR(cuLinkDestroy(linkState));
	RETURN_ON_CUDA_ERROR(cuCtxDestroy(context));			RETURN_ON_CUDA_ERROR(cuCtxDestroy(context));

	return result;			return result;
	}			}

	// Register pass to serialize GPU kernel functions to a CUBIN binary annotation.			// Register pass to serialize GPU kernel functions to a CUBIN binary annotation.
	void mlir::registerGpuSerializeToCubinPass() {			void mlir::registerGpuSerializeToCubinPass() {
	PassRegistration<SerializeToCubinPass> registerSerializeToCubin(			PassRegistration<SerializeToCubinPass> registerSerializeToCubin([] {
	[] { return std::make_unique<SerializeToCubinPass>(); });			// Initialize LLVM NVPTX backend.
				LLVMInitializeNVPTXTarget();
				LLVMInitializeNVPTXTargetInfo();
				LLVMInitializeNVPTXTargetMC();
				LLVMInitializeNVPTXAsmPrinter();

				return std::make_unique<SerializeToCubinPass>();
				});
	}			}

	std::unique_ptr<Pass> mlir::createGpuSerializeToCubinPass(StringRef triple,			std::unique_ptr<Pass> mlir::createGpuSerializeToCubinPass(
	StringRef arch,			const gpu::SerializationToCubinOptions &options) {
	StringRef features,			return std::make_unique<SerializeToCubinPass>(
	int optLevel,			options.triple, options.chip, options.features, options.optLevel,
	bool dumpPtx) {			options.dumpPtx, options.usePtxas, options.ptxasParams);
	return std::make_unique<SerializeToCubinPass>(triple, arch, features,
	optLevel, dumpPtx);
	}			}

	#else // MLIR_GPU_TO_CUBIN_PASS_ENABLE			#else // MLIR_GPU_TO_CUBIN_PASS_ENABLE
	void mlir::registerGpuSerializeToCubinPass() {}			void mlir::registerGpuSerializeToCubinPass() {}
	#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE			#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE

mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp

Show First 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	void mlir::sparse_tensor::buildSparseCompiler(
pm.addPass(createConvertVectorToLLVMPass(options.lowerVectorToLLVMOptions()));		pm.addPass(createConvertVectorToLLVMPass(options.lowerVectorToLLVMOptions()));
pm.addPass(createConvertComplexToLLVMPass());		pm.addPass(createConvertComplexToLLVMPass());
pm.addPass(createConvertVectorToLLVMPass(options.lowerVectorToLLVMOptions()));		pm.addPass(createConvertVectorToLLVMPass(options.lowerVectorToLLVMOptions()));
pm.addPass(createConvertFuncToLLVMPass());		pm.addPass(createConvertFuncToLLVMPass());

// Finalize GPU code generation.		// Finalize GPU code generation.
if (gpuCodegen) {		if (gpuCodegen) {
#if MLIR_GPU_TO_CUBIN_PASS_ENABLE		#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(		gpu::SerializationToCubinOptions cubinOptions;
options.gpuTriple, options.gpuChip, options.gpuFeatures));		cubinOptions.triple = options.gpuTriple;
		cubinOptions.chip = options.gpuChip;
		cubinOptions.features = options.gpuFeatures;
		pm.addNestedPass<gpu::GPUModuleOp>(
		createGpuSerializeToCubinPass(cubinOptions));
#endif		#endif
pm.addPass(createGpuToLLVMConversionPass());		pm.addPass(createGpuToLLVMConversionPass());
}		}

pm.addPass(createReconcileUnrealizedCastsPass());		pm.addPass(createReconcileUnrealizedCastsPass());
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
Show All 11 Lines

mlir/test/lib/Dialect/GPU/TestLowerToNVVM.cpp

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	void buildGpuPassPipeline(OpPassManager &pm,
// Sprinkle some cleanups.		// Sprinkle some cleanups.
pm.addPass(createCanonicalizerPass());		pm.addPass(createCanonicalizerPass());
pm.addPass(createCSEPass());		pm.addPass(createCSEPass());

// Finally we can reconcile unrealized casts.		// Finally we can reconcile unrealized casts.
pm.addNestedPass<gpu::GPUModuleOp>(createReconcileUnrealizedCastsPass());		pm.addNestedPass<gpu::GPUModuleOp>(createReconcileUnrealizedCastsPass());

#if MLIR_GPU_TO_CUBIN_PASS_ENABLE		#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(		gpu::SerializationToCubinOptions cubinOptions;
options.cubinTriple, options.cubinChip, options.cubinFeatures));		cubinOptions.triple = options.cubinTriple;
		cubinOptions.chip = options.cubinChip;
		cubinOptions.features = options.cubinFeatures;
		pm.addNestedPass<gpu::GPUModuleOp>(
		createGpuSerializeToCubinPass(cubinOptions));
#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE		#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE
}		}

void buildLowerToNVVMPassPipeline(OpPassManager &pm,		void buildLowerToNVVMPassPipeline(OpPassManager &pm,
const TestLowerToNVVMOptions &options) {		const TestLowerToNVVMOptions &options) {
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Host-specific stuff.		// Host-specific stuff.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Improving Cubin Serialization with ptxas CompilerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 543449

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp

mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp

mlir/test/lib/Dialect/GPU/TestLowerToNVVM.cpp

[mlir][gpu] Improving Cubin Serialization with ptxas Compiler
ClosedPublic