This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Skip setting `externally_initialized` for static device variables.
ClosedPublic

Authored by hliao on May 29 2019, 8:43 AM.

Diff Detail

Repository
rC Clang

Event Timeline

hliao created this revision.May 29 2019, 8:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2019, 8:43 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
yaxunl added a reviewer: tra.May 29 2019, 8:55 AM

LGTM. The externally initializable attribute causes some optimizations disabled. For static device variables it seems reasonable to remove the externaly initializable attribute.

tra accepted this revision.May 29 2019, 10:11 AM
tra added inline comments.
clang/test/CodeGenCUDA/device-var-init.cu
39 ↗(On Diff #201938)

Please add a host-side check that the host-side shadow of the variable is still an undef.

This revision is now accepted and ready to land.May 29 2019, 10:11 AM

thanks, but that static __device__ variable won't have shadow in host anymore.

This revision was automatically updated to reflect the committed changes.
tra added a comment.May 29 2019, 10:24 AM

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

tra added a comment.May 29 2019, 10:28 AM

Note for the future -- it would be great if we could finish discussing the patch before landing it.
I would still like to see the host-side test.

In D62603#1521503, @tra wrote:

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

In D62603#1521507, @tra wrote:

Note for the future -- it would be great if we could finish discussing the patch before landing it.
I would still like to see the host-side test.

Sorry, will follow that rule. Yes, that patch only changes the device-side. But, for host-side, even that variable is declared as static as well, but there's no reference to it. clang just skip emitting it.

In D62603#1521503, @tra wrote:

thanks, but that static __device__ variable won't have shadow in host anymore.

Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

tra added a comment.May 29 2019, 12:12 PM

NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

Magic. :-)
More practical example would be something like this:

__device__ int array[10];

__host__ func() {
  cudaMemset(array, 0, sizeof(array));
}

cudaMemset is a host function and it needs to use something that exists on the host side as the first argument.
In order to deal with this, compiler:

  • creates uninitialized int array[10] on the host side. This allows ising sizeof(array) on the host size.
  • registers its address/size with CUDA runtime. This allows passing address of host-side shadow array to various CUDA runtime routines. The runtime knows what it has on device side and maps shadow's address to the real device address. This way CUDA runtime functions can make static device-side data accessible without having to explicitly figure out their device-side address.
In D62603#1521788, @tra wrote:

NVCC also allows that: https://godbolt.org/z/t78RvM

BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?

Magic. :-)
More practical example would be something like this:

__device__ int array[10];

__host__ func() {
  cudaMemset(array, 0, sizeof(array));
}

cudaMemset is a host function and it needs to use something that exists on the host side as the first argument.
In order to deal with this, compiler:

  • creates uninitialized int array[10] on the host side. This allows ising sizeof(array) on the host size.
  • registers its address/size with CUDA runtime. This allows passing address of host-side shadow array to various CUDA runtime routines. The runtime knows what it has on device side and maps shadow's address to the real device address. This way CUDA runtime functions can make static device-side data accessible without having to explicitly figure out their device-side address.

that should assume that variable is not declared with static. that's also the motivation of this patch.

tra added a comment.May 29 2019, 12:29 PM

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

In D62603#1521832, @tra wrote:

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

That's true if there's a reference on the host side. E.g, if I modify foo function as both host and __device, that host-side shadow could be generated (with 'undef` initializer as expected.)

In D62603#1521832, @tra wrote:

that should assume that variable is not declared with static. that's also the motivation of this patch.

cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

Are you sure nvcc support accessing static __device__ variables in host code? That would be expensive to implement. Instead of looking up dynamic symble tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.

tra added a comment.May 29 2019, 1:53 PM

I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.

Are you sure nvcc support accessing static __device__ variables in host code? That would be expensive to implement.

Address (of the shadow, translatable to device address) and size -- yes. Values -- no.

E.g. you can pass &array as a parameter to the kernel. Host-side code will use shadow's address, but device-side kernel will get the real device-side address, translated from the shadow address by the runtime.

Instead of looking up dynamic symbol tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.

I'm not sure I understand what you're saying. CUDA runtime and device-side object file management is a black box to me, so I don't know how exactly NVIDIA has implemented this on device side, but the fact remains. host must have some way to refer to (some) device-side entities. Specifically, kernels and the global variables, whether they are nominally static or not.