This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Don't call __cudaRegisterVariable on C++17 inline variables
ClosedPublic

Authored by MaskRay on Oct 3 2020, 11:51 AM.

Details

Summary

D17779: host-side shadow variables of external declarations of device-side
global variables have internal linkage and are referenced by
__cuda_register_globals.

nvcc from CUDA 11 does not allow __device__ inline or __device__ constexpr
(C++17 inline variables) but clang has incorrectly supported them for a while:

error: A __device__ variable cannot be marked constexpr
error: An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

A symbol table entry with STB_LOCAL binding that is defined relative to one of a group's sections, and that is contained in a symbol table section that is not part of the group, must be discarded if the group members are discarded. References to this symbol table entry from outside the group are not allowed.

As a workaround, don't register such inline variables for now.
(If we register the variables in all TUs, we will keep multiple instances of the shadow and break the C++ semantics for inline variables).
We should reject such variables in Sema but our internal users need some time to migrate.

Diff Detail

Event Timeline

MaskRay created this revision.Oct 3 2020, 11:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2020, 11:51 AM
MaskRay requested review of this revision.Oct 3 2020, 11:51 AM
MaskRay added a comment.EditedOct 3 2020, 1:58 PM

Maybe we should disallow it instead. nvcc from CUDA 11.1 does not allow __device__ inline or __device__ constexpr

edit: I can't. We have users of __device__ constexpr variables. We need time for them to migrate away from the nvcc unsupported feature.

MaskRay updated this revision to Diff 296007.Oct 3 2020, 2:51 PM
MaskRay retitled this revision from [CUDA] Suppress comdat on host-side shadow variables registered by __cuda_register_globals to [CUDA] Don't call __cudaRegisterVariable on C++17 inline variables.
MaskRay edited the summary of this revision. (Show Details)

Reject isInline instead

MaskRay edited the summary of this revision. (Show Details)Oct 3 2020, 2:53 PM
MaskRay edited the summary of this revision. (Show Details)
tra added a comment.Oct 5 2020, 11:23 AM

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

Every TU is the whole program on the GPU side, provided we compile w/o -frdc, so there's no other TU to prevail.
I don't have a good idea yet what's the best way to handle this in CUDA, but not registering the variables will likely to create other issues, only visible at runtime. E.g. some host-side code will attempt to use cudaMemcpy() on the symbol and will fail, because it's not been registered, even though we do have all other glue in place.

Could you provide an example where this is causing an issue?

In D88786#2312329, @tra wrote:

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

Every TU is the whole program on the GPU side, provided we compile w/o -frdc, so there's no other TU to prevail.
I don't have a good idea yet what's the best way to handle this in CUDA, but not registering the variables will likely to create other issues, only visible at runtime. E.g. some host-side code will attempt to use cudaMemcpy() on the symbol and will fail, because it's not been registered, even though we do have all other glue in place.

Could you provide an example where this is causing an issue?

If the C++17 inline variable appears in two TUs. They have the same comdat group. The first comdat group is prevailing and the second one is disarded. __cudaRegisterVar(...) in the second TU references a local symbol in a discarded section.

The previous revision (https://reviews.llvm.org/D88786?id=295997 ) drops the comdat, but I think it is inferior to this one.

tra added a comment.Oct 5 2020, 12:22 PM

Could you provide an example where this is causing an issue?

If the C++17 inline variable appears in two TUs. They have the same comdat group. The first comdat group is prevailing and the second one is disarded. __cudaRegisterVar(...) in the second TU references a local symbol in a discarded section.

So, if I understand you correctly, it's the *host* side which ends up dropping it in one of TUs. It is a bit of a problem, considering that both of those TUs will need their own register call for their own GPU-side counterpart of the variable.

a.h:
  __device__ inline int foo;
a1.cu: #inlcude "a.h"
  a1.o/host : inline int foo; // 'shadow' variable. 
              register(foo, gpu-side-foo) // tell runtime that when we use host-side foo we want to access device-side foo.
  a1/GPU: int foo; // the only device-side instance. It's always there.
a2.cu: #inlcude "a.h"
  a2.o/host : inline int foo; // 'shadow' variable. 
              register(foo, gpu-side-foo) // tell runtime that when we use host-side foo we want to access device-side foo.
  a2/GPU: int foo; // the only device-side instance. It's always there.

host_exe: a1.o, a2.o
  only one instance of inline int foo survives and we lose ability to tell which GPU-side `foo` we want to access when we use host-side foo shadow.

Not allowing inline/constexpr variables seems to be the only choice here. Otherwise we's have to keep multiple instances of the shadow and that would break the C++ semantics for inline and constexpr

The previous revision (https://reviews.llvm.org/D88786?id=295997 ) drops the comdat, but I think it is inferior to this one.

Silently dropping variable registration shifts the problem from link time to runtime. It may be OK as a temporary workaround for the build issues and only fraction of those will run into it at runtime, so it's technically an improvement, but we will need to catch it in Sema ASAP.

tra accepted this revision.Oct 5 2020, 12:22 PM
This revision is now accepted and ready to land.Oct 5 2020, 12:22 PM
MaskRay edited the summary of this revision. (Show Details)Oct 5 2020, 12:52 PM
This revision was landed with ongoing or failed builds.Oct 5 2020, 12:55 PM
This revision was automatically updated to reflect the committed changes.
yaxunl added a comment.Oct 5 2020, 2:43 PM

This patch may break some existing HIP applications.

For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged. HIP runtime just ignores the same shadow var registered with the same device var, everything should work.

For nordc mode, device vars are in different fat binaries. If shadow vars are not in comdat and not merged, they can be registered with device vars in different fat binaries. Things would still work.

I think inline variable and static constexpr member are very useful features. Disabling them for device variable is a big limitation.

tra added inline comments.Oct 7 2020, 1:17 PM
clang/lib/CodeGen/CodeGenModule.cpp
4137

For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged.

Right. I think we need to add || (getLangOpts().HIP && getLangOpts().GPURelocatableDeviceCode). Maybe even for both CUDA and HIP as rdc should work similarly in CUDA, too.

MaskRay marked an inline comment as done.Oct 7 2020, 2:28 PM
MaskRay added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
4137

I don't know -rdc=true. Hope @tra and @yaxunl can make the change with a description.

I confirm that __device__ inline int works under nvcc with -rdc=true but I don't know its implication with __cudaRegisterVariable.
constexpr is still forbidden.