This is an archive of the discontinued LLVM Phabricator instance.

Differential D11993

[CUDA] Make sure we emit all templated global functions on device side. Again.
AbandonedPublic

Authored by tra on Aug 12 2015, 2:30 PM.

Download Raw Diff

Details

Reviewers

eliben
echristo
rsmith

Summary

This is a somewhat different way to do it than D11666 which got rolled back.

Codegen postpones emitting instantiated kernel function template until it's used.
If kernel is used only from the host side (which is normally the case) we'll never emit
it because on device side we don't emit the host code that uses it.

The change allows CUDA kernels to be emitted on device side unconditionally.
It's overly conservative and may emit more functions than we really need, but it
guarantees that the kernels launched from the host side are do exist on device-side.
In case it ever causes issues, there are other ways to address the issue,
though they are more invasive and are currently not worth the trouble.

Diff Detail

Event Timeline

tra updated this revision to Diff 31973.Aug 12 2015, 2:30 PM

tra retitled this revision from to [CUDA] Make sure we emit all templated __global__ functions on device side. Again..

tra updated this object.

tra added reviewers: echristo, rsmith, eliben.

tra added a subscriber: cfe-commits.

LGTM. Thanks for working on this.

-eric

This revision is now accepted and ready to land.Aug 12 2015, 2:34 PM

lgtm

test/CodeGenCUDA/ptx-kernels.cu
26	It't not clear what this metadata is part of? I'm guessing llvm.used, so maybe make that explicit with an earlier CHECK?

Emitting IR is not sufficient to ensure that the kernels survive GDCE, so the patch does not work with optimizations on.
D11666 would have to do for now.

Revision Contents

Path

Size

lib/

AST/

ASTContext.cpp

14 lines

test/

CodeGenCUDA/

ptx-kernels.cu

10 lines

Diff 31973

lib/AST/ASTContext.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,330 Lines • ▼ Show 20 Lines	bool ASTContext::DeclMustBeEmitted(const Decl *D) {
if (const FunctionDecl *FD = dyn_cast<FunctionDecl>(D)) {		if (const FunctionDecl *FD = dyn_cast<FunctionDecl>(D)) {
// Forward declarations aren't required.		// Forward declarations aren't required.
if (!FD->doesThisDeclarationHaveABody())		if (!FD->doesThisDeclarationHaveABody())
return FD->doesDeclarationForceExternallyVisibleDefinition();		return FD->doesDeclarationForceExternallyVisibleDefinition();

// Constructors and destructors are required.		// Constructors and destructors are required.
if (FD->hasAttr<ConstructorAttr>() \|\| FD->hasAttr<DestructorAttr>())		if (FD->hasAttr<ConstructorAttr>() \|\| FD->hasAttr<DestructorAttr>())
return true;		return true;

		// Force all CUDA kernels to be emitted on device side.
		// Otherwise, templated kernels may never be emitted as they are
		// only used from host-side code which we never emit on device
		// side and which therefore would never trigger us to emit
		// device-side kernel it might've instantiated. The trade-off is
		// that emitting all kernels is over-conservative and we may emit
		// more of them than necessary. If excess of generated GPU code
		// becomes a problem we can revisit this.
		if (getLangOpts().CUDA && getLangOpts().CUDAIsDevice &&
		FD->hasAttr<CUDAGlobalAttr>())
		return true;

// The key function for a class is required. This rule only comes		// The key function for a class is required. This rule only comes
// into play when inline functions can be key functions, though.		// into play when inline functions can be key functions, though.
if (getTargetInfo().getCXXABI().canKeyFunctionBeInline()) {		if (getTargetInfo().getCXXABI().canKeyFunctionBeInline()) {
if (const CXXMethodDecl *MD = dyn_cast<CXXMethodDecl>(FD)) {		if (const CXXMethodDecl *MD = dyn_cast<CXXMethodDecl>(FD)) {
const CXXRecordDecl *RD = MD->getParent();		const CXXRecordDecl *RD = MD->getParent();
if (MD->isOutOfLine() && RD->isDynamicClass()) {		if (MD->isOutOfLine() && RD->isDynamicClass()) {
const CXXMethodDecl *KeyFunc = getCurrentKeyFunction(RD);		const CXXMethodDecl *KeyFunc = getCurrentKeyFunction(RD);
if (KeyFunc && KeyFunc->getCanonicalDecl() == MD->getCanonicalDecl())		if (KeyFunc && KeyFunc->getCanonicalDecl() == MD->getCanonicalDecl())
▲ Show 20 Lines • Show All 397 Lines • Show Last 20 Lines

test/CodeGenCUDA/ptx-kernels.cu

				// Make sure that __global__ functions are emitted along with correct
				// annotations and are added to @llvm.used to prevent their elimination.
				// REQUIRES: nvptx-registered-target
				//
	// RUN: %clang_cc1 %s -triple nvptx-unknown-unknown -fcuda-is-device -emit-llvm -o - \| FileCheck %s			// RUN: %clang_cc1 %s -triple nvptx-unknown-unknown -fcuda-is-device -emit-llvm -o - \| FileCheck %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// CHECK-LABEL: define void @device_function			// CHECK-LABEL: define void @device_function
	extern "C"			extern "C"
	__device__ void device_function() {}			__device__ void device_function() {}

	// CHECK-LABEL: define void @global_function			// CHECK-LABEL: define void @global_function
	extern "C"			extern "C"
	__global__ void global_function() {			__global__ void global_function() {
	// CHECK: call void @device_function			// CHECK: call void @device_function
	device_function();			device_function();
	}			}

				// Make sure host-instantiated kernels are preserved on device side.
				template <typename T> __global__ void templated_kernel(T param) {}
				// CHECK-LABEL: define linkonce_odr void @_Z16templated_kernelIiEvT_
				void host_function() { templated_kernel<<<0,0>>>(0); }

	// CHECK: !{{[0-9]+}} = !{void ()* @global_function, !"kernel", i32 1}			// CHECK: !{{[0-9]+}} = !{void ()* @global_function, !"kernel", i32 1}
				// CHECK: !{{[0-9]+}} = !{void (i32)* @_Z16templated_kernelIiEvT_, !"kernel", i32 1}
				elibenUnsubmitted Not Done Reply Inline Actions It't not clear what this metadata is part of? I'm guessing llvm.used, so maybe make that explicit with an earlier CHECK? eliben: It't not clear what this metadata is part of? I'm guessing llvm.used, so maybe make that…