This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/
-
CMakeLists.txt
-
CodeGen/
8/31
PPCGCodeGeneration.cpp
-
test/GPGPU/
-
GPGPU/
3
libdevice-functions-copied-into-kernel.ll
1/2
libdevice-functions-copied-into-kernel_libdevice.bc

Differential D35703

[GPGPU] Add support for NVIDIA libdevice
ClosedPublic

Authored by grosser on Jul 20 2017, 3:44 PM.

Download Raw Diff

Details

Reviewers

bollu
singam-sanjay

Commits

rG8fc6cdfb1cbd: [GPGPU] Add support for NVIDIA libdevice
rPLO309560: [GPGPU] Add support for NVIDIA libdevice
rL309560: [GPGPU] Add support for NVIDIA libdevice

Summary

This allows us to map functions such as exp, expf, expl, for which no
LLVM intrinsics exist. Instead, we link to NVIDIA's libdevice which provides
high-performance implementations of a wide range of (math) functions. We
currently link only a small subset, the exp* and cos functions. Other functions
will be enabled as needed.

Diff Detail

Build Status

Buildable 8459
Build 8459: arc lint + arc unit

Event Timeline

grosser created this revision.Jul 20 2017, 3:44 PM

Herald added subscribers: kbarton, mgorny, nemanjai. · View Herald TranscriptJul 20 2017, 3:44 PM

bollu added inline comments.Jul 20 2017, 3:54 PM

lib/CodeGen/PPCGCodeGeneration.cpp
1314	Consider changing to `SmallSet`? That feels correct in terms of semantics (a `set` of functions.)
1418	`const bool`? :)
2131	consider moving computing `RequiresLibDevice` to a pure function?
2138	I believe we can `assert` at this point, since this is not a "mis-compile" in the strictest sense of the word?.
2148	nit: `trible` -> `triple`.

tra added a subscriber: tra.Jul 20 2017, 3:59 PM

tra added inline comments.

lib/CodeGen/PPCGCodeGeneration.cpp
107–111	This is something that is useful for all NVPTX users and should probably live there and it should not have any hardcoded path -- it's too easy to end up silently picking wrong library otherwise. Hardcoded compute_20 is also problematic because it should depend on particular GPU arch we're compiling for. Considering that LLVM has no idea about CUDA SDL location, this is sommething that should always be explicitly specified. Either base path + libdevice name derived from GPU arch, or complete path to specific libdevice variant (i.e. it's completely up to the user to provide correct libdevice).
3020	Bllock -> Block
test/GPGPU/libdevice-functions-copied-into-kernel.ll
38–69	Can the test be reduced to just expf() call?
test/GPGPU/libdevice-functions-copied-into-kernel_libdevice.bc
1–6	This file should be under Inputs/ directory (see NVPTX tests for example) and have .ll extension.

bollu added a reviewer: singam-sanjay.Jul 21 2017, 7:06 AM

Is there some way to test this without having libdevice? The tests break on my mac.

bollu added inline comments.Jul 24 2017, 3:18 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1314	could you also please add `sqrt` to the list? this is present in the `COSMO` kernel.

singam-sanjay added inline comments.Jul 24 2017, 8:09 AM

lib/CodeGen/PPCGCodeGeneration.cpp
107	Would it be better to call this CUDALibDevice or CULibDevice instead ? since this applies only to NVPTX
109	Consider changing this to "/usr/local/cuda/nvvm/libdevice/libdevice.compute_20_10.bc". That would work on most Linux platforms by default. I'm not sure if PTX code for a compute capability 2 device would run on any newer device. Is it possible to initialize this after figuring out the CC of device 0 ? Also, I heard that CUDA SDK 8 would be the last to support CC 2.x. CUDA 9 supports all CCs from 3.0.
609	Would addCULibDevice be a better name ?

Address review comments

Thank you for all the good reviews. I tried to address all of them.

Best,
Tobias

lib/CodeGen/PPCGCodeGeneration.cpp
107	I use CUDALibDevice
107–111	@tra: Thank you for your comment. This is the very first commit to introduce this feature. We currently are in early beta tests. The library location is supposed to be provided by the user by setting the path with polly-acc-libdevice. I set the option to a very basic default. For now I expect the user to adjust this default. In the future we can add some generic infrastructure to LLVM to derive this path automatically. If a specific fixed path is too confusing I can also use an empty default and always prompt for a path.
109	Changed. using /usr/local/cuda is indeed a good idea. I would like to start with the oldest library. We can later add support for different library versions. I don't think we can query device 0, as we might compile on different platform as where we run the final code. However, we can make this depend on polly-acc-cuda-version.
609	Changed.
1314	Done.!
1314	I use a std::set. That should be good enough for now.
2131	Done.
2138	I use report_fatal_error as suggested by Michael.
2148	Done!
3020	Fixed in r308715.
test/GPGPU/libdevice-functions-copied-into-kernel.ll
38–69	Unlikely. This is a test for Polly-ACC, where we auto-offload to CUDA. For this we need at least some parallelism, which means some loop.
test/GPGPU/libdevice-functions-copied-into-kernel_libdevice.bc
1–6	Very good idea. I adopted it.

How is this different from passing libdevice to either of the -mlink-cuda-bitcode or -mlink-bticode-file options ?

lib/CodeGen/PPCGCodeGeneration.cpp
2154	Could this be avoided by implementing Triple::isCompatibleWith() for nvptx?

tra added inline comments.Jul 24 2017, 3:51 PM

lib/CodeGen/PPCGCodeGeneration.cpp
107–111	I believe no default would be a better option in this case as it minimizes possibility for things to go wrong silently.
113	The variable appears to be unused. On a side note, please consider that there's already a way to specify GPU variant for NVPTX back-end. This option is either going to be redundant or you'll need a good explanation for what's supposed to happen when its value conflicts with whatever GPU variant NVPTX back-end thinks it's supposed to generate the code for.
118	This also appears to be unused. Please remove them from this patch and re-introduce them in subsequent patches that really need them.
2122	!empty()
2133	What's supposed to happen in this case? Consider adding a diagnostic message.
2138	But why are you still printing the libdevice name on stderr?
2161	I'm curious -- when would it be OK to proceed if verifier has failed? Should this option be other way around -- and make LLVM fail by default on verifier failure? That would be my expectation of normal behavior. One could conceivably use this option for debugging purposes to force LLVM to proceed even when verifier has failed, but in general I believe that by default the errors should be reported as early as possible.
test/GPGPU/libdevice-functions-copied-into-kernel.ll
38–69	It may be worth reconsidering your approach and making this functionality generic to NVPTX so it can benefit all users of NVPTX back-end. The functionality is generic enough to benefit CUDA (and possibly OpenCL) which currently fail miserably if any of standard library functions sneak into IR.

Other than nits, LGTM

lib/CodeGen/PPCGCodeGeneration.cpp
1314	this can be `const std::set<std::string> ...` ? I do not see us mutating this, and it would be nice to communicate this fact.
test/lit.site.cfg.in
37 ↗	(On Diff #107971)	Can we be more explicit and mention that these are for the `libdevice` tests?

This revision is now accepted and ready to land.Jul 25 2017, 12:50 AM

bollu added inline comments.Jul 28 2017, 5:13 AM

lib/CodeGen/PPCGCodeGeneration.cpp
1314	Could you also add `copysign`, please?

Closed by commit rL309560: [GPGPU] Add support for NVIDIA libdevice (authored by grosser). · Explain WhyJul 31 2017, 7:04 AM

This revision was automatically updated to reflect the committed changes.

grosser marked an inline comment as done.

Revision Contents

Path

Size

lib/

CMakeLists.txt

2 lines

CodeGen/

PPCGCodeGeneration.cpp

102 lines

test/

GPGPU/

libdevice-functions-copied-into-kernel.ll

74 lines

libdevice-functions-copied-into-kernel_libdevice.bc

6 lines

Diff 107597

lib/CMakeLists.txt

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	target_link_libraries(Polly
LLVMCore		LLVMCore
LLVMScalarOpts		LLVMScalarOpts
LLVMInstCombine		LLVMInstCombine
LLVMTransformUtils		LLVMTransformUtils
LLVMAnalysis		LLVMAnalysis
LLVMipo		LLVMipo
LLVMMC		LLVMMC
LLVMPasses		LLVMPasses
		LLVMLinker
		LLVMIRReader
${nvptx_libs}		${nvptx_libs}
# The libraries below are required for darwin: http://PR26392		# The libraries below are required for darwin: http://PR26392
LLVMBitReader		LLVMBitReader
LLVMMCParser		LLVMMCParser
LLVMObject		LLVMObject
LLVMProfileData		LLVMProfileData
LLVMTarget		LLVMTarget
LLVMVectorize		LLVMVectorize
Show All 39 Lines

lib/CodeGen/PPCGCodeGeneration.cpp

Show All 25 Lines
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/BasicAliasAnalysis.h"		#include "llvm/Analysis/BasicAliasAnalysis.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"		#include "llvm/Analysis/ScalarEvolutionAliasAnalysis.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/LegacyPassManager.h"		#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Verifier.h"		#include "llvm/IR/Verifier.h"
		#include "llvm/IRReader/IRReader.h"
		#include "llvm/Linker/Linker.h"
#include "llvm/Support/TargetRegistry.h"		#include "llvm/Support/TargetRegistry.h"
#include "llvm/Support/TargetSelect.h"		#include "llvm/Support/TargetSelect.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
#include "llvm/Transforms/IPO/PassManagerBuilder.h"		#include "llvm/Transforms/IPO/PassManagerBuilder.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"

#include "isl/union_map.h"		#include "isl/union_map.h"

▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
static cl::opt<bool>		static cl::opt<bool>
FailOnVerifyModuleFailure("polly-acc-fail-on-verify-module-failure",		FailOnVerifyModuleFailure("polly-acc-fail-on-verify-module-failure",
cl::desc("Fail and generate a backtrace if"		cl::desc("Fail and generate a backtrace if"
" verifyModule fails on the GPU "		" verifyModule fails on the GPU "
" kernel module."),		" kernel module."),
cl::Hidden, cl::init(false), cl::ZeroOrMore,		cl::Hidden, cl::init(false), cl::ZeroOrMore,
cl::cat(PollyCategory));		cl::cat(PollyCategory));

		static cl::opt<std::string> LibDevice(
		singam-sanjayUnsubmitted Done Reply Inline Actions Would it be better to call this CUDALibDevice or CULibDevice instead ? since this applies only to NVPTX singam-sanjay: Would it be better to call this CUDALibDevice or CULibDevice instead ? since this applies only…
		grosserAuthorUnsubmitted Not Done Reply Inline Actions I use CUDALibDevice grosser: I use CUDALibDevice
		"polly-acc-libdevice", cl::desc("Path to CUDA libdevice"), cl::Hidden,
		cl::init("/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_20.10.bc"),
		singam-sanjayUnsubmitted Done Reply Inline Actions Consider changing this to "/usr/local/cuda/nvvm/libdevice/libdevice.compute_20_10.bc". That would work on most Linux platforms by default. I'm not sure if PTX code for a compute capability 2 device would run on any newer device. Is it possible to initialize this after figuring out the CC of device 0 ? Also, I heard that CUDA SDK 8 would be the last to support CC 2.x. CUDA 9 supports all CCs from 3.0. singam-sanjay: Consider changing this to "/usr/local/cuda/nvvm/libdevice/libdevice.compute_20_10.bc". That…
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Changed. using /usr/local/cuda is indeed a good idea. I would like to start with the oldest library. We can later add support for different library versions. I don't think we can query device 0, as we might compile on different platform as where we run the final code. However, we can make this depend on polly-acc-cuda-version. grosser: Changed. using /usr/local/cuda is indeed a good idea. I would like to start with the oldest…
		cl::ZeroOrMore, cl::cat(PollyCategory));

		traUnsubmitted Not Done Reply Inline Actions This is something that is useful for all NVPTX users and should probably live there and it should not have any hardcoded path -- it's too easy to end up silently picking wrong library otherwise. Hardcoded compute_20 is also problematic because it should depend on particular GPU arch we're compiling for. Considering that LLVM has no idea about CUDA SDL location, this is sommething that should always be explicitly specified. Either base path + libdevice name derived from GPU arch, or complete path to specific libdevice variant (i.e. it's completely up to the user to provide correct libdevice). tra: This is something that is useful for all NVPTX users and should probably live there and it…
		grosserAuthorUnsubmitted Not Done Reply Inline Actions @tra: Thank you for your comment. This is the very first commit to introduce this feature. We currently are in early beta tests. The library location is supposed to be provided by the user by setting the path with polly-acc-libdevice. I set the option to a very basic default. For now I expect the user to adjust this default. In the future we can add some generic infrastructure to LLVM to derive this path automatically. If a specific fixed path is too confusing I can also use an empty default and always prompt for a path. grosser: @tra: Thank you for your comment. This is the very first commit to introduce this feature. We…
		traUnsubmitted Not Done Reply Inline Actions I believe no default would be a better option in this case as it minimizes possibility for things to go wrong silently. tra: I believe no default would be a better option in this case as it minimizes possibility for…
static cl::opt<std::string>		static cl::opt<std::string>
CudaVersion("polly-acc-cuda-version",		CudaVersion("polly-acc-cuda-version",
		traUnsubmitted Not Done Reply Inline Actions The variable appears to be unused. On a side note, please consider that there's already a way to specify GPU variant for NVPTX back-end. This option is either going to be redundant or you'll need a good explanation for what's supposed to happen when its value conflicts with whatever GPU variant NVPTX back-end thinks it's supposed to generate the code for. tra: The variable appears to be unused. On a side note, please consider that there's already a way…
cl::desc("The CUDA version to compile for"), cl::Hidden,		cl::desc("The CUDA version to compile for"), cl::Hidden,
cl::init("sm_30"), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init("sm_30"), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<int>		static cl::opt<int>
MinCompute("polly-acc-mincompute",		MinCompute("polly-acc-mincompute",
		traUnsubmitted Not Done Reply Inline Actions This also appears to be unused. Please remove them from this patch and re-introduce them in subsequent patches that really need them. tra: This also appears to be unused. Please remove them from this patch and re-introduce them in…
cl::desc("Minimal number of compute statements to run on GPU."),		cl::desc("Minimal number of compute statements to run on GPU."),
cl::Hidden, cl::init(10 * 512 * 512));		cl::Hidden, cl::init(10 * 512 * 512));

/// Used to store information PPCG wants for kills. This information is		/// Used to store information PPCG wants for kills. This information is
/// used by live range reordering.		/// used by live range reordering.
///		///
/// @see computeLiveRangeReordering		/// @see computeLiveRangeReordering
/// @see GPUNodeBuilder::createPPCGScop		/// @see GPUNodeBuilder::createPPCGScop
▲ Show 20 Lines • Show All 473 Lines • ▼ Show 20 Lines	private:
/// @param F The function to remove references to.		/// @param F The function to remove references to.
void clearScalarEvolution(Function *F);		void clearScalarEvolution(Function *F);

/// Remove references from loop info to the kernel function @p F.		/// Remove references from loop info to the kernel function @p F.
///		///
/// @param F The function to remove references to.		/// @param F The function to remove references to.
void clearLoops(Function *F);		void clearLoops(Function *F);

		/// Link with the NVIDIA libdevice library (if needed and available).
		void addLibDevice();
		singam-sanjayUnsubmitted Done Reply Inline Actions Would addCULibDevice be a better name ? singam-sanjay: Would addCULibDevice be a better name ?
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Changed. grosser: Changed.

/// Finalize the generation of the kernel function.		/// Finalize the generation of the kernel function.
///		///
/// Free the LLVM-IR module corresponding to the kernel and -- if requested --		/// Free the LLVM-IR module corresponding to the kernel and -- if requested --
/// dump its IR to stderr.		/// dump its IR to stderr.
///		///
/// @returns The Assembly string of the kernel.		/// @returns The Assembly string of the kernel.
std::string finalizeKernelFunction();		std::string finalizeKernelFunction();

▲ Show 20 Lines • Show All 686 Lines • ▼ Show 20 Lines	isl_bool collectReferencesInGPUStmt(__isl_keep isl_ast_node Node, void User) {
auto Stmt = (ScopStmt *)KernelStmt->u.d.stmt->stmt;		auto Stmt = (ScopStmt *)KernelStmt->u.d.stmt->stmt;
isl_id_free(Id);		isl_id_free(Id);

addReferencesFromStmt(Stmt, User, false /* CreateScalarRefs */);		addReferencesFromStmt(Stmt, User, false /* CreateScalarRefs */);

return isl_bool_true;		return isl_bool_true;
}		}

		/// A list of functions that are available in NVIDIA's libdevice.
		std::vector<std::string> LibDeviceFunctions = {"exp", "expf", "expl", "cos",
		bolluUnsubmitted Done Reply Inline Actions Consider changing to `SmallSet`? That feels correct in terms of semantics (a `set` of functions.) bollu: Consider changing to `SmallSet`? That feels correct in terms of semantics (a `set` of functions.
		grosserAuthorUnsubmitted Not Done Reply Inline Actions I use a std::set. That should be good enough for now. grosser: I use a std::set. That should be good enough for now.
		bolluUnsubmitted Done Reply Inline Actions could you also please add `sqrt` to the list? this is present in the `COSMO` kernel. bollu: could you also please add `sqrt` to the list? this is present in the `COSMO` kernel.
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Done.! grosser: Done.!
		bolluUnsubmitted Not Done Reply Inline Actions this can be `const std::set<std::string> ...` ? I do not see us mutating this, and it would be nice to communicate this fact. bollu: this can be `const std::set<std::string> ...` ? I do not see us mutating this, and it would be…
		bolluUnsubmitted Not Done Reply Inline Actions Could you also add `copysign`, please? bollu: Could you also add `copysign`, please?
		"cosf"};

		/// Return the corresponding CUDA libdevice function name for @p F.
		///
		/// Return "" if we are not compiling for CUDA.
		std::string getLibDeviceFuntion(Function *F) {
		for (auto Name : LibDeviceFunctions)
		if (Name == F->getName())
		return "__nv_" + Name;

		return "";
		}

/// Check if F is a function that we can code-generate in a GPU kernel.		/// Check if F is a function that we can code-generate in a GPU kernel.
static bool isValidFunctionInKernel(llvm::Function *F) {		static bool isValidFunctionInKernel(llvm::Function *F, bool AllowLibDevice) {
assert(F && "F is an invalid pointer");		assert(F && "F is an invalid pointer");
// We string compare against the name of the function to allow		// We string compare against the name of the function to allow
// all variants of the intrinsic "llvm.sqrt.*", "llvm.fabs", and		// all variants of the intrinsic "llvm.sqrt.*", "llvm.fabs", and
// "llvm.copysign".		// "llvm.copysign".
const StringRef Name = F->getName();		const StringRef Name = F->getName();

		if (AllowLibDevice && getLibDeviceFuntion(F).length() > 0)
		return true;

return F->isIntrinsic() &&		return F->isIntrinsic() &&
(Name.startswith("llvm.sqrt") \|\| Name.startswith("llvm.fabs") \|\|		(Name.startswith("llvm.sqrt") \|\| Name.startswith("llvm.fabs") \|\|
Name.startswith("llvm.copysign"));		Name.startswith("llvm.copysign"));
}		}

/// Do not take `Function` as a subtree value.		/// Do not take `Function` as a subtree value.
///		///
/// We try to take the reference of all subtree values and pass them along		/// We try to take the reference of all subtree values and pass them along
/// to the kernel from the host. Taking an address of any function and		/// to the kernel from the host. Taking an address of any function and
/// trying to pass along is nonsensical. Only allow `Value`s that are not		/// trying to pass along is nonsensical. Only allow `Value`s that are not
/// `Function`s.		/// `Function`s.
static bool isValidSubtreeValue(llvm::Value *V) { return !isa<Function>(V); }		static bool isValidSubtreeValue(llvm::Value *V) { return !isa<Function>(V); }

/// Return `Function`s from `RawSubtreeValues`.		/// Return `Function`s from `RawSubtreeValues`.
static SetVector<Function *>		static SetVector<Function *>
getFunctionsFromRawSubtreeValues(SetVector<Value *> RawSubtreeValues) {		getFunctionsFromRawSubtreeValues(SetVector<Value *> RawSubtreeValues,
		bool AllowLibDevice) {
SetVector<Function *> SubtreeFunctions;		SetVector<Function *> SubtreeFunctions;
for (Value *It : RawSubtreeValues) {		for (Value *It : RawSubtreeValues) {
Function *F = dyn_cast<Function>(It);		Function *F = dyn_cast<Function>(It);
if (F) {		if (F) {
assert(isValidFunctionInKernel(F) && "Code should have bailed out by "		assert(isValidFunctionInKernel(F, AllowLibDevice) &&
		"Code should have bailed out by "
"this point if an invalid function "		"this point if an invalid function "
"were present in a kernel.");		"were present in a kernel.");
SubtreeFunctions.insert(F);		SubtreeFunctions.insert(F);
}		}
}		}
return SubtreeFunctions;		return SubtreeFunctions;
}		}

std::pair<SetVector<Value >, SetVector<Function >>		std::pair<SetVector<Value >, SetVector<Function >>
GPUNodeBuilder::getReferencesInKernel(ppcg_kernel *Kernel) {		GPUNodeBuilder::getReferencesInKernel(ppcg_kernel *Kernel) {
Show All 37 Lines	GPUNodeBuilder::getReferencesInKernel(ppcg_kernel *Kernel) {
// SubtreeValues. This is important, because we should not lose any		// SubtreeValues. This is important, because we should not lose any
// SubtreeValues in the process of constructing the		// SubtreeValues in the process of constructing the
// "ValidSubtree{Values, Functions} sets. Nor should the set		// "ValidSubtree{Values, Functions} sets. Nor should the set
// ValidSubtree{Values, Functions} have any common element.		// ValidSubtree{Values, Functions} have any common element.
auto ValidSubtreeValuesIt =		auto ValidSubtreeValuesIt =
make_filter_range(SubtreeValues, isValidSubtreeValue);		make_filter_range(SubtreeValues, isValidSubtreeValue);
SetVector<Value *> ValidSubtreeValues(ValidSubtreeValuesIt.begin(),		SetVector<Value *> ValidSubtreeValues(ValidSubtreeValuesIt.begin(),
ValidSubtreeValuesIt.end());		ValidSubtreeValuesIt.end());

		bool AllowLibDevice = Arch == GPUArch::NVPTX64;
		bolluUnsubmitted Not Done Reply Inline Actions `const bool`? :) bollu: `const bool`? :)

SetVector<Function *> ValidSubtreeFunctions(		SetVector<Function *> ValidSubtreeFunctions(
getFunctionsFromRawSubtreeValues(SubtreeValues));		getFunctionsFromRawSubtreeValues(SubtreeValues, AllowLibDevice));

// @see IslNodeBuilder::getReferencesInSubtree		// @see IslNodeBuilder::getReferencesInSubtree
SetVector<Value *> ReplacedValues;		SetVector<Value *> ReplacedValues;
for (Value *V : ValidSubtreeValues) {		for (Value *V : ValidSubtreeValues) {
auto It = ValueMap.find(V);		auto It = ValueMap.find(V);
if (It == ValueMap.end())		if (It == ValueMap.end())
ReplacedValues.insert(V);		ReplacedValues.insert(V);
else		else
▲ Show 20 Lines • Show All 678 Lines • ▼ Show 20 Lines	if (TargetM->addPassesToEmitFile(
return "";		return "";
}		}

PM.run(*GPUModule);		PM.run(*GPUModule);

return ASMStream.str();		return ASMStream.str();
}		}

		void GPUNodeBuilder::addLibDevice() {
		if (Arch != GPUArch::NVPTX64)
		return;

		bool RequiresLibDevice = false;

		for (Function &F : GPUModule->functions()) {
		traUnsubmitted Not Done Reply Inline Actions !empty() tra: !empty()
		if (!F.isDeclaration())
		continue;

		std::string LibDeviceFunc = getLibDeviceFuntion(&F);
		if (LibDeviceFunc.length() != 0) {
		F.setName(LibDeviceFunc);
		RequiresLibDevice = true;
		}
		}
		bolluUnsubmitted Done Reply Inline Actions consider moving computing `RequiresLibDevice` to a pure function? bollu: consider moving computing `RequiresLibDevice` to a pure function?
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Done. grosser: Done.

		if (RequiresLibDevice) {
		traUnsubmitted Not Done Reply Inline Actions What's supposed to happen in this case? Consider adding a diagnostic message. tra: What's supposed to happen in this case? Consider adding a diagnostic message.
		SMDiagnostic Error;
		auto LibDeviceModule =
		parseIRFile(LibDevice, Error, GPUModule->getContext());

		if (!LibDeviceModule) {
		bolluUnsubmitted Not Done Reply Inline Actions I believe we can `assert` at this point, since this is not a "mis-compile" in the strictest sense of the word?. bollu: I believe we can `assert` at this point, since this is not a "mis-compile" in the strictest…
		grosserAuthorUnsubmitted Not Done Reply Inline Actions I use report_fatal_error as suggested by Michael. grosser: I use report_fatal_error as suggested by Michael.
		traUnsubmitted Not Done Reply Inline Actions But why are you still printing the libdevice name on stderr? tra: But why are you still printing the libdevice name on stderr?
		BuildSuccessful = false;
		errs() << "Could not find libdevice. Skipping GPU kernel generation. "
		"Please set -polly-acc-libdevice accordingly.\n";
		return;
		}

		Linker L(*GPUModule);

		// Set an nvptx64 target triple to avoid linker warnings. The original
		// trible of the libdevice files are nvptx-unknown-unknown.
		bolluUnsubmitted Done Reply Inline Actions nit: `trible` -> `triple`. bollu: nit: `trible` -> `triple`.
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Done! grosser: Done!
		LibDeviceModule->setTargetTriple(Triple::normalize("nvptx64-nvidia-cuda"));
		L.linkInModule(std::move(LibDeviceModule), Linker::LinkOnlyNeeded);
		}
		}

std::string GPUNodeBuilder::finalizeKernelFunction() {		std::string GPUNodeBuilder::finalizeKernelFunction() {
		tstellarUnsubmitted Not Done Reply Inline Actions Could this be avoided by implementing Triple::isCompatibleWith() for nvptx? tstellar: Could this be avoided by implementing Triple::isCompatibleWith() for nvptx?

if (verifyModule(*GPUModule)) {		if (verifyModule(*GPUModule)) {
DEBUG(dbgs() << "verifyModule failed on module:\n";		DEBUG(dbgs() << "verifyModule failed on module:\n";
GPUModule->print(dbgs(), nullptr); dbgs() << "\n";);		GPUModule->print(dbgs(), nullptr); dbgs() << "\n";);

if (FailOnVerifyModuleFailure)		if (FailOnVerifyModuleFailure)
llvm_unreachable("VerifyModule failed.");		llvm_unreachable("VerifyModule failed.");
		traUnsubmitted Not Done Reply Inline Actions I'm curious -- when would it be OK to proceed if verifier has failed? Should this option be other way around -- and make LLVM fail by default on verifier failure? That would be my expectation of normal behavior. One could conceivably use this option for debugging purposes to force LLVM to proceed even when verifier has failed, but in general I believe that by default the errors should be reported as early as possible. tra: I'm curious -- when would it be OK to proceed if verifier has failed? Should this option be…

BuildSuccessful = false;		BuildSuccessful = false;
return "";		return "";
}		}

		addLibDevice();

if (DumpKernelIR)		if (DumpKernelIR)
outs() << *GPUModule << "\n";		outs() << *GPUModule << "\n";

// Optimize module.		// Optimize module.
llvm::legacy::PassManager OptPasses;		llvm::legacy::PassManager OptPasses;
PassManagerBuilder PassBuilder;		PassManagerBuilder PassBuilder;
PassBuilder.OptLevel = 3;		PassBuilder.OptLevel = 3;
PassBuilder.SizeLevel = 0;		PassBuilder.SizeLevel = 0;
▲ Show 20 Lines • Show All 835 Lines • ▼ Show 20 Lines	createSufficientComputeCheck(Scop &S, __isl_keep isl_ast_build *Build) {
return isl_ast_expr_ge(Iterations, MinComputeExpr);		return isl_ast_expr_ge(Iterations, MinComputeExpr);
}		}

/// Check if the basic block contains a function we cannot codegen for GPU		/// Check if the basic block contains a function we cannot codegen for GPU
/// kernels.		/// kernels.
///		///
/// If this basic block does something with a `Function` other than calling		/// If this basic block does something with a `Function` other than calling
/// a function that we support in a kernel, return true.		/// a function that we support in a kernel, return true.
bool containsInvalidKernelFunctionInBllock(const BasicBlock *BB) {		bool containsInvalidKernelFunctionInBllock(const BasicBlock *BB,
		traUnsubmitted Done Reply Inline Actions Bllock -> Block tra: Bllock -> Block
		grosserAuthorUnsubmitted Not Done Reply Inline Actions Fixed in r308715. grosser: Fixed in r308715.
		bool AllowLibDevice) {
for (const Instruction &Inst : *BB) {		for (const Instruction &Inst : *BB) {
const CallInst *Call = dyn_cast<CallInst>(&Inst);		const CallInst *Call = dyn_cast<CallInst>(&Inst);
if (Call && isValidFunctionInKernel(Call->getCalledFunction())) {		if (Call &&
		isValidFunctionInKernel(Call->getCalledFunction(), AllowLibDevice)) {
continue;		continue;
}		}

for (Value *SrcVal : Inst.operands()) {		for (Value *SrcVal : Inst.operands()) {
PointerType *p = dyn_cast<PointerType>(SrcVal->getType());		PointerType *p = dyn_cast<PointerType>(SrcVal->getType());
if (!p)		if (!p)
continue;		continue;
if (isa<FunctionType>(p->getElementType()))		if (isa<FunctionType>(p->getElementType()))
return true;		return true;
}		}
}		}
return false;		return false;
}		}

/// Return whether the Scop S uses functions in a way that we do not support.		/// Return whether the Scop S uses functions in a way that we do not support.
bool containsInvalidKernelFunction(const Scop &S) {		bool containsInvalidKernelFunction(const Scop &S, bool AllowLibDevice) {
for (auto &Stmt : S) {		for (auto &Stmt : S) {
if (Stmt.isBlockStmt()) {		if (Stmt.isBlockStmt()) {
if (containsInvalidKernelFunctionInBllock(Stmt.getBasicBlock()))		if (containsInvalidKernelFunctionInBllock(Stmt.getBasicBlock(),
		AllowLibDevice))
return true;		return true;
} else {		} else {
assert(Stmt.isRegionStmt() &&		assert(Stmt.isRegionStmt() &&
"Stmt was neither block nor region statement");		"Stmt was neither block nor region statement");
for (const BasicBlock *BB : Stmt.getRegion()->blocks())		for (const BasicBlock *BB : Stmt.getRegion()->blocks())
if (containsInvalidKernelFunctionInBllock(BB))		if (containsInvalidKernelFunctionInBllock(BB, AllowLibDevice))
return true;		return true;
}		}
}		}
return false;		return false;
}		}

/// Generate code for a given GPU AST described by @p Root.		/// Generate code for a given GPU AST described by @p Root.
///		///
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	bool runOnScop(Scop &CurrentScop) override {
DL = &S->getRegion().getEntry()->getModule()->getDataLayout();		DL = &S->getRegion().getEntry()->getModule()->getDataLayout();
RI = &getAnalysis<RegionInfoPass>().getRegionInfo();		RI = &getAnalysis<RegionInfoPass>().getRegionInfo();

// We currently do not support functions other than intrinsics inside		// We currently do not support functions other than intrinsics inside
// kernels, as code generation will need to offload function calls to the		// kernels, as code generation will need to offload function calls to the
// kernel. This may lead to a kernel trying to call a function on the host.		// kernel. This may lead to a kernel trying to call a function on the host.
// This also allows us to prevent codegen from trying to take the		// This also allows us to prevent codegen from trying to take the
// address of an intrinsic function to send to the kernel.		// address of an intrinsic function to send to the kernel.
if (containsInvalidKernelFunction(CurrentScop)) {		if (containsInvalidKernelFunction(CurrentScop,
		Architecture == GPUArch::NVPTX64)) {
DEBUG(		DEBUG(
dbgs()		dbgs()
<< "Scop contains function which cannot be materialised in a GPU "		<< "Scop contains function which cannot be materialised in a GPU "
"kernel. Bailing out.\n";);		"kernel. Bailing out.\n";);
return false;		return false;
}		}

auto PPCGScop = createPPCGScop();		auto PPCGScop = createPPCGScop();
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

test/GPGPU/libdevice-functions-copied-into-kernel.ll

This file was added.

				; RUN: opt %loadPolly -analyze -polly-scops < %s \
				; RUN: -polly-acc-libdevice=%S/libdevice-functions-copied-into-kernel_libdevice.bc \
				; RUN: \| FileCheck %s --check-prefix=SCOP
				; RUN: opt %loadPolly -analyze -polly-codegen-ppcg -polly-acc-dump-kernel-ir \
				; RUN: -polly-acc-libdevice=%S/libdevice-functions-copied-into-kernel_libdevice.bc \
				; RUN: < %s \| FileCheck %s --check-prefix=KERNEL-IR
				; RUN: opt %loadPolly -S -polly-codegen-ppcg < %s \
				; RUN: -polly-acc-libdevice=%S/libdevice-functions-copied-into-kernel_libdevice.bc \
				; RUN: \| FileCheck %s --check-prefix=HOST-IR

				; Test that we do recognise and codegen a kernel that has functions that can
				; be mapped to NVIDIA's libdevice

				; REQUIRES: pollyacc

				; Check that we model the kernel as a scop.
				; SCOP: Function: f
				; SCOP-NEXT: Region: %entry.split---%for.end

				; Check that the intrinsic call is present in the kernel IR.
				; KERNEL-IR: %p_expf = tail call float @__nv_expf(float %A.arr.i.val_p_scalar_)

				; Check that kernel launch is generated in host IR.
				; the declare would not be generated unless a call to a kernel exists.
				; HOST-IR: declare void @polly_launchKernel(i8, i32, i32, i32, i32, i32, i8)


				; void f(float A, float B, int N) {
				; for(int i = 0; i < N; i++) {
				; float tmp0 = A[i];
				; float tmp1 = expf(tmp1);
				; B[i] = tmp1;
				; }
				; }

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				define void @f(float* %A, float* %B, i32 %N) {
				entry:
				br label %entry.split

				entry.split: ; preds = %entry
				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry.split
				br label %for.body

				for.body: ; preds = %for.body.lr.ph, %for.body
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%A.arr.i = getelementptr inbounds float, float* %A, i64 %indvars.iv
				%A.arr.i.val = load float, float* %A.arr.i, align 4
				; Call to intrinsics that should be part of the kernel.
				%expf = tail call float @expf(float %A.arr.i.val)
				%B.arr.i = getelementptr inbounds float, float* %B, i64 %indvars.iv
				store float %expf, float* %B.arr.i, align 4

				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%wide.trip.count = zext i32 %N to i64
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.body, label %for.cond.for.end_crit_edge

				for.cond.for.end_crit_edge: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.cond.for.end_crit_edge, %entry.split
				ret void
				}

				traUnsubmitted Not Done Reply Inline Actions Can the test be reduced to just expf() call? tra: Can the test be reduced to just expf() call?
				grosserAuthorUnsubmitted Not Done Reply Inline Actions Unlikely. This is a test for Polly-ACC, where we auto-offload to CUDA. For this we need at least some parallelism, which means some loop. grosser: Unlikely. This is a test for Polly-ACC, where we auto-offload to CUDA. For this we need at…
				traUnsubmitted Not Done Reply Inline Actions It may be worth reconsidering your approach and making this functionality generic to NVPTX so it can benefit all users of NVPTX back-end. The functionality is generic enough to benefit CUDA (and possibly OpenCL) which currently fail miserably if any of standard library functions sneak into IR. tra: It may be worth reconsidering your approach and making this functionality generic to NVPTX so…
				; Function Attrs: nounwind readnone
				declare float @expf(float) #0

				attributes #0 = { nounwind readnone }

test/GPGPU/libdevice-functions-copied-into-kernel_libdevice.bc

This file was added.

				definelayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				define float @__nv_expf(float %a) #1 {
				return %a
				}
				traUnsubmitted Done Reply Inline Actions This file should be under Inputs/ directory (see NVPTX tests for example) and have .ll extension. tra: This file should be under Inputs/ directory (see NVPTX tests for example) and have .ll…
				grosserAuthorUnsubmitted Not Done Reply Inline Actions Very good idea. I adopted it. grosser: Very good idea. I adopted it.