This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
-
CodeGenModule.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
device-var-init.cu

Differential D62603

[CUDA][HIP] Skip setting `externally_initialized` for static device variables.
ClosedPublic

Authored by hliao on May 29 2019, 8:43 AM.

Download Raw Diff

Details

Reviewers

yaxunl
tra

Commits

rG4b7a713accdc: [CUDA][HIP] Skip setting `externally_initialized` for static device variables.
rL361994: [CUDA][HIP] Skip setting `externally_initialized` for static device variables.
rC361994: [CUDA][HIP] Skip setting `externally_initialized` for static device variables.

Summary

By declaring device variables as static, we assume they won't be addressable from the host side. Thus, no externally_initialized is required.

Diff Detail

Repository: rC Clang

Event Timeline

hliao created this revision.May 29 2019, 8:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2019, 8:43 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B32620: Diff 201938.May 29 2019, 8:44 AM

LGTM. The externally initializable attribute causes some optimizations disabled. For static device variables it seems reasonable to remove the externaly initializable attribute.

tra accepted this revision.May 29 2019, 10:11 AM

tra added inline comments.

clang/test/CodeGenCUDA/device-var-init.cu
39 ↗	(On Diff #201938)	Please add a host-side check that the host-side shadow of the variable is still an undef.

This revision is now accepted and ready to land.May 29 2019, 10:11 AM

thanks, but that static __device__ variable won't have shadow in host anymore.

Closed by commit rC361994: [CUDA][HIP] Skip setting `externally_initialized` for static device variables. (authored by hliao). · Explain WhyMay 29 2019, 10:24 AM

This revision was automatically updated to reflect the committed changes.

In D62603#1521484, @hliao wrote:

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

Note for the future -- it would be great if we could finish discussing the patch before landing it.
I would still like to see the host-side test.

In D62603#1521503, @tra wrote:

In D62603#1521484, @hliao wrote:

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

In D62603#1521507, @tra wrote:

Note for the future -- it would be great if we could finish discussing the patch before landing it.
I would still like to see the host-side test.

Sorry, will follow that rule. Yes, that patch only changes the device-side. But, for host-side, even that variable is declared as static as well, but there's no reference to it. clang just skip emitting it.

In D62603#1521503, @tra wrote:

In D62603#1521484, @hliao wrote:

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

Magic. :-)
More practical example would be something like this:

__device__ int array[10];

__host__ func() {
  cudaMemset(array, 0, sizeof(array));
}

cudaMemset is a host function and it needs to use something that exists on the host side as the first argument.
In order to deal with this, compiler:

creates uninitialized int array[10] on the host side. This allows ising sizeof(array) on the host size.
registers its address/size with CUDA runtime. This allows passing address of host-side shadow array to various CUDA runtime routines. The runtime knows what it has on device side and maps shadow's address to the real device address. This way CUDA runtime functions can make static device-side data accessible without having to explicitly figure out their device-side address.

In D62603#1521788, @tra wrote:
NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

Magic. :-)
More practical example would be something like this:
__device__ int array[10];

__host__ func() {
  cudaMemset(array, 0, sizeof(array));
}
cudaMemset is a host function and it needs to use something that exists on the host side as the first argument.
In order to deal with this, compiler:

creates uninitialized int array[10] on the host side. This allows ising sizeof(array) on the host size.

registers its address/size with CUDA runtime. This allows passing address of host-side shadow array to various CUDA runtime routines. The runtime knows what it has on device side and maps shadow's address to the real device address. This way CUDA runtime functions can make static device-side data accessible without having to explicitly figure out their device-side address.

that should assume that variable is not declared with static. that's also the motivation of this patch.

In D62603#1521792, @hliao wrote:

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

In D62603#1521832, @tra wrote:

In D62603#1521792, @hliao wrote:

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

That's true if there's a reference on the host side. E.g, if I modify foo function as both host and __device, that host-side shadow could be generated (with 'undef` initializer as expected.)

In D62603#1521832, @tra wrote:

In D62603#1521792, @hliao wrote:

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

Are you sure nvcc support accessing static __device__ variables in host code? That would be expensive to implement. Instead of looking up dynamic symble tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.

In D62603#1521979, @yaxunl wrote:

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

Are you sure nvcc support accessing static __device__ variables in host code? That would be expensive to implement.

Address (of the shadow, translatable to device address) and size -- yes. Values -- no.

E.g. you can pass &array as a parameter to the kernel. Host-side code will use shadow's address, but device-side kernel will get the real device-side address, translated from the shadow address by the runtime.

Instead of looking up dynamic symbol tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.

I'm not sure I understand what you're saying. CUDA runtime and device-side object file management is a black box to me, so I don't know how exactly NVIDIA has implemented this on device side, but the fact remains. host must have some way to refer to (some) device-side entities. Specifically, kernels and the global variables, whether they are nominally static or not.

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenModule.cpp

3 lines

test/

CodeGenCUDA/

device-var-init.cu

10 lines

Diff 201975

lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 3,863 Lines • ▼ Show 20 Lines	void CodeGenModule::EmitGlobalVarDefinition(const VarDecl *D,
// the device. [...]"		// the device. [...]"
// CUDA B.2.2 "The __constant__ qualifier, optionally used together with		// CUDA B.2.2 "The __constant__ qualifier, optionally used together with
// __device__, declares a variable that: [...]		// __device__, declares a variable that: [...]
// Is accessible from all the threads within the grid and from the host		// Is accessible from all the threads within the grid and from the host
// through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize()		// through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize()
// / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol())."		// / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol())."
if (GV && LangOpts.CUDA) {		if (GV && LangOpts.CUDA) {
if (LangOpts.CUDAIsDevice) {		if (LangOpts.CUDAIsDevice) {
if (D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>())		if (Linkage != llvm::GlobalValue::InternalLinkage &&
		(D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()))
GV->setExternallyInitialized(true);		GV->setExternallyInitialized(true);
} else {		} else {
// Host-side shadows of external declarations of device-side		// Host-side shadows of external declarations of device-side
// global variables become internal definitions. These have to		// global variables become internal definitions. These have to
// be internal in order to prevent name conflicts with global		// be internal in order to prevent name conflicts with global
// host variables with the same name in a different TUs.		// host variables with the same name in a different TUs.
if (D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()) {		if (D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()) {
Linkage = llvm::GlobalValue::InternalLinkage;		Linkage = llvm::GlobalValue::InternalLinkage;
▲ Show 20 Lines • Show All 1,908 Lines • Show Last 20 Lines

test/CodeGenCUDA/device-var-init.cu

	Show All 27 Lines
	__constant__ int c_v;			__constant__ int c_v;
	// DEVICE: addrspace(4) externally_initialized global i32 0,			// DEVICE: addrspace(4) externally_initialized global i32 0,
	// HOST: @c_v = internal global i32 undef,			// HOST: @c_v = internal global i32 undef,

	__device__ int d_v_i = 1;			__device__ int d_v_i = 1;
	// DEVICE: @d_v_i = addrspace(1) externally_initialized global i32 1,			// DEVICE: @d_v_i = addrspace(1) externally_initialized global i32 1,
	// HOST: @d_v_i = internal global i32 undef,			// HOST: @d_v_i = internal global i32 undef,

				// For `static` device variables, assume they won't be addressed from the host
				// side.
				static __device__ int d_s_v_i = 1;
				// DEVICE: @_ZL7d_s_v_i = internal addrspace(1) global i32 1,

				// Dummy function to keep static variables referenced.
				__device__ int foo() {
				return d_s_v_i;
				}

	// trivial constructor -- allowed			// trivial constructor -- allowed
	__device__ T d_t;			__device__ T d_t;
	// DEVICE: @d_t = addrspace(1) externally_initialized global %struct.T zeroinitializer			// DEVICE: @d_t = addrspace(1) externally_initialized global %struct.T zeroinitializer
	// HOST: @d_t = internal global %struct.T undef,			// HOST: @d_t = internal global %struct.T undef,
	__shared__ T s_t;			__shared__ T s_t;
	// DEVICE: @s_t = addrspace(3) global %struct.T undef,			// DEVICE: @s_t = addrspace(3) global %struct.T undef,
	// HOST: @s_t = internal global %struct.T undef,			// HOST: @s_t = internal global %struct.T undef,
	__constant__ T c_t;			__constant__ T c_t;
	▲ Show 20 Lines • Show All 254 Lines • Show Last 20 Lines