- By declaring device variables as static, we assume they won't be addressable from the host side. Thus, no externally_initialized is required.
Details
Diff Detail
- Repository
- rC Clang
Event Timeline
LGTM. The externally initializable attribute causes some optimizations disabled. For static device variables it seems reasonable to remove the externaly initializable attribute.
clang/test/CodeGenCUDA/device-var-init.cu | ||
---|---|---|
39 ↗ | (On Diff #201938) | Please add a host-side check that the host-side shadow of the variable is still an undef. |
Why not? Your change only changes whether externally_initialized is applied to the variable during device-side compilation. It does not change what happens on the host side.
AFAICT, it will still be generated on the host side and the host side should still be able to take its address.
NVCC also allows that: https://godbolt.org/z/t78RvM
Note for the future -- it would be great if we could finish discussing the patch before landing it.
I would still like to see the host-side test.
Sorry, will follow that rule. Yes, that patch only changes the device-side. But, for host-side, even that variable is declared as static as well, but there's no reference to it. clang just skip emitting it.
BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?
NVCC also allows that: https://godbolt.org/z/t78RvM
BTW, that code posted looks quite weird to me, how the code could make sense by return a pointer of device variable? or a pointer of shadow host variable?
Magic. :-)
More practical example would be something like this:
__device__ int array[10]; __host__ func() { cudaMemset(array, 0, sizeof(array)); }
cudaMemset is a host function and it needs to use something that exists on the host side as the first argument.
In order to deal with this, compiler:
- creates uninitialized int array[10] on the host side. This allows ising sizeof(array) on the host size.
- registers its address/size with CUDA runtime. This allows passing address of host-side shadow array to various CUDA runtime routines. The runtime knows what it has on device side and maps shadow's address to the real device address. This way CUDA runtime functions can make static device-side data accessible without having to explicitly figure out their device-side address.
that should assume that variable is not declared with static. that's also the motivation of this patch.
cppreference defines internal linkage as 'The name can be referred to from all scopes in the current translation unit.'
The current translation unit in CUDA context gets a bit murky. On one hand host and device are compiled separately, and may conceivably be considered separate TUs. On the other hand, the fact that we mix host and device code in the same source file implies tight coupling and the users do expect them to be treated as if all host and device code in the source file is in the same TU. E.g. you may have a kernel in an anonymous namespace yet you do want to be able to launch it from the host side.
I think static __device__ globals would fall into the same category -- nominally they should not be visible outside of device-side object file, but in practice we do need to make them visible from the host side of the same TU.
That's true if there's a reference on the host side. E.g, if I modify foo function as both host and __device, that host-side shadow could be generated (with 'undef` initializer as expected.)
Are you sure nvcc support accessing static __device__ variables in host code? That would be expensive to implement. Instead of looking up dynamic symble tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.
Address (of the shadow, translatable to device address) and size -- yes. Values -- no.
E.g. you can pass &array as a parameter to the kernel. Host-side code will use shadow's address, but device-side kernel will get the real device-side address, translated from the shadow address by the runtime.
Instead of looking up dynamic symbol tables only, now we need to look up symbol tables for local symbols. Also we have to differentiate local symbols that have the same name. This also means user can not strip symbol tables.
I'm not sure I understand what you're saying. CUDA runtime and device-side object file management is a black box to me, so I don't know how exactly NVIDIA has implemented this on device side, but the fact remains. host must have some way to refer to (some) device-side entities. Specifically, kernels and the global variables, whether they are nominally static or not.