This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
2/2
CodeGenModule.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
device-stub.cu

Differential D88786

[CUDA] Don't call __cudaRegisterVariable on C++17 inline variables
ClosedPublic

Authored by MaskRay on Oct 3 2020, 11:51 AM.

Download Raw Diff

Details

Reviewers

jlebar
tra

Commits

rGa2cc8833683d: [CUDA] Don't call __cudaRegisterVariable on C++17 inline variables

Summary

D17779: host-side shadow variables of external declarations of device-side
global variables have internal linkage and are referenced by
__cuda_register_globals.

nvcc from CUDA 11 does not allow __device__ inline or __device__ constexpr
(C++17 inline variables) but clang has incorrectly supported them for a while:

error: A __device__ variable cannot be marked constexpr
error: An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

A symbol table entry with STB_LOCAL binding that is defined relative to one of a group's sections, and that is contained in a symbol table section that is not part of the group, must be discarded if the group members are discarded. References to this symbol table entry from outside the group are not allowed.

As a workaround, don't register such inline variables for now.
(If we register the variables in all TUs, we will keep multiple instances of the shadow and break the C++ semantics for inline variables).
We should reject such variables in Sema but our internal users need some time to migrate.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MaskRay created this revision.Oct 3 2020, 11:51 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2020, 11:51 AM

Herald added subscribers: cfe-commits, yaxunl. · View Herald Transcript

MaskRay requested review of this revision.Oct 3 2020, 11:51 AM

Harbormaster completed remote builds in B73886: Diff 295997.Oct 3 2020, 12:04 PM

Maybe we should disallow it instead. nvcc from CUDA 11.1 does not allow __device__ inline or __device__ constexpr

edit: I can't. We have users of __device__ constexpr variables. We need time for them to migrate away from the nvcc unsupported feature.

Reject isInline instead

MaskRay edited the summary of this revision. (Show Details)Oct 3 2020, 2:53 PM

MaskRay edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B73894: Diff 296007.Oct 3 2020, 3:04 PM

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

Every TU is the whole program on the GPU side, provided we compile w/o -frdc, so there's no other TU to prevail.
I don't have a good idea yet what's the best way to handle this in CUDA, but not registering the variables will likely to create other issues, only visible at runtime. E.g. some host-side code will attempt to use cudaMemcpy() on the symbol and will fail, because it's not been registered, even though we do have all other glue in place.

Could you provide an example where this is causing an issue?

In D88786#2312329, @tra wrote:

If such a variable (which has a comdat group) is discarded (a copy from another
translation unit is prevailing and selected), accessing the variable from
outside the section group (__cuda_register_globals) is a violation of the ELF
specification and will be rejected by linkers:

Every TU is the whole program on the GPU side, provided we compile w/o -frdc, so there's no other TU to prevail.
I don't have a good idea yet what's the best way to handle this in CUDA, but not registering the variables will likely to create other issues, only visible at runtime. E.g. some host-side code will attempt to use cudaMemcpy() on the symbol and will fail, because it's not been registered, even though we do have all other glue in place.

Could you provide an example where this is causing an issue?

If the C++17 inline variable appears in two TUs. They have the same comdat group. The first comdat group is prevailing and the second one is disarded. __cudaRegisterVar(...) in the second TU references a local symbol in a discarded section.

The previous revision (https://reviews.llvm.org/D88786?id=295997 ) drops the comdat, but I think it is inferior to this one.

In D88786#2312365, @MaskRay wrote:

Could you provide an example where this is causing an issue?

If the C++17 inline variable appears in two TUs. They have the same comdat group. The first comdat group is prevailing and the second one is disarded. __cudaRegisterVar(...) in the second TU references a local symbol in a discarded section.

So, if I understand you correctly, it's the *host* side which ends up dropping it in one of TUs. It is a bit of a problem, considering that both of those TUs will need their own register call for their own GPU-side counterpart of the variable.

a.h:
  __device__ inline int foo;
a1.cu: #inlcude "a.h"
  a1.o/host : inline int foo; // 'shadow' variable. 
              register(foo, gpu-side-foo) // tell runtime that when we use host-side foo we want to access device-side foo.
  a1/GPU: int foo; // the only device-side instance. It's always there.
a2.cu: #inlcude "a.h"
  a2.o/host : inline int foo; // 'shadow' variable. 
              register(foo, gpu-side-foo) // tell runtime that when we use host-side foo we want to access device-side foo.
  a2/GPU: int foo; // the only device-side instance. It's always there.

host_exe: a1.o, a2.o
  only one instance of inline int foo survives and we lose ability to tell which GPU-side `foo` we want to access when we use host-side foo shadow.

Not allowing inline/constexpr variables seems to be the only choice here. Otherwise we's have to keep multiple instances of the shadow and that would break the C++ semantics for inline and constexpr

The previous revision (https://reviews.llvm.org/D88786?id=295997 ) drops the comdat, but I think it is inferior to this one.

Silently dropping variable registration shifts the problem from link time to runtime. It may be OK as a temporary workaround for the build issues and only fraction of those will run into it at runtime, so it's technically an improvement, but we will need to catch it in Sema ASAP.

tra accepted this revision.Oct 5 2020, 12:22 PM

This revision is now accepted and ready to land.Oct 5 2020, 12:22 PM

MaskRay edited the summary of this revision. (Show Details)Oct 5 2020, 12:52 PM

This revision was landed with ongoing or failed builds.Oct 5 2020, 12:55 PM

Closed by commit rGa2cc8833683d: [CUDA] Don't call __cudaRegisterVariable on C++17 inline variables (authored by MaskRay). · Explain Why

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rGa2cc8833683d: [CUDA] Don't call __cudaRegisterVariable on C++17 inline variables.

This patch may break some existing HIP applications.

For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged. HIP runtime just ignores the same shadow var registered with the same device var, everything should work.

For nordc mode, device vars are in different fat binaries. If shadow vars are not in comdat and not merged, they can be registered with device vars in different fat binaries. Things would still work.

I think inline variable and static constexpr member are very useful features. Disabling them for device variable is a big limitation.

tra added inline comments.Oct 7 2020, 1:17 PM

clang/lib/CodeGen/CodeGenModule.cpp
4137	For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged. Right. I think we need to add `\|\| (getLangOpts().HIP && getLangOpts().GPURelocatableDeviceCode)`. Maybe even for both CUDA and HIP as `rdc` should work similarly in CUDA, too.

MaskRay marked an inline comment as done.Oct 7 2020, 2:28 PM

MaskRay added inline comments.

clang/lib/CodeGen/CodeGenModule.cpp
4137	I don't know `-rdc=true`. Hope @tra and @yaxunl can make the change with a description. I confirm that `__device__ inline int` works under nvcc with -rdc=true but I don't know its implication with `__cudaRegisterVariable`. constexpr is still forbidden.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CodeGenModule.cpp

7 lines

test/

CodeGenCUDA/

device-stub.cu

20 lines

Diff 296270

clang/lib/CodeGen/CodeGenModule.cpp

Show First 20 Lines • Show All 4,123 Lines • ▼ Show 20 Lines	if (LangOpts.CUDAIsDevice) {
// global variables become internal definitions. These have to		// global variables become internal definitions. These have to
// be internal in order to prevent name conflicts with global		// be internal in order to prevent name conflicts with global
// host variables with the same name in a different TUs.		// host variables with the same name in a different TUs.
if (D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()) {		if (D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()) {
Linkage = llvm::GlobalValue::InternalLinkage;		Linkage = llvm::GlobalValue::InternalLinkage;
// Shadow variables and their properties must be registered with CUDA		// Shadow variables and their properties must be registered with CUDA
// runtime. Skip Extern global variables, which will be registered in		// runtime. Skip Extern global variables, which will be registered in
// the TU where they are defined.		// the TU where they are defined.
if (!D->hasExternalStorage())		//
		// Don't register a C++17 inline variable. The local symbol can be
		// discarded and referencing a discarded local symbol from outside the
		// comdat (__cuda_register_globals) is disallowed by the ELF spec.
		// TODO: Reject __device__ constexpr and __device__ inline in Sema.
		if (!D->hasExternalStorage() && !D->isInline())
		traUnsubmitted Not Done Reply Inline Actions For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged. Right. I think we need to add `\|\| (getLangOpts().HIP && getLangOpts().GPURelocatableDeviceCode)`. Maybe even for both CUDA and HIP as `rdc` should work similarly in CUDA, too. tra: > For rdc mode, device vars are merged. Host shadow vars should also be in comdat and merged.
		MaskRayAuthorUnsubmitted Done Reply Inline Actions I don't know `-rdc=true`. Hope @tra and @yaxunl can make the change with a description. I confirm that `__device__ inline int` works under nvcc with -rdc=true but I don't know its implication with `__cudaRegisterVariable`. constexpr is still forbidden. MaskRay: I don't know `-rdc=true`. Hope @tra and @yaxunl can make the change with a description. I…
getCUDARuntime().registerDeviceVar(D, *GV, !D->hasDefinition(),		getCUDARuntime().registerDeviceVar(D, *GV, !D->hasDefinition(),
D->hasAttr<CUDAConstantAttr>());		D->hasAttr<CUDAConstantAttr>());
} else if (D->hasAttr<CUDASharedAttr>()) {		} else if (D->hasAttr<CUDASharedAttr>()) {
// __shared__ variables are odd. Shadows do get created, but		// __shared__ variables are odd. Shadows do get created, but
// they are not registered with the CUDA runtime, so they		// they are not registered with the CUDA runtime, so they
// can't really be used to access their device-side		// can't really be used to access their device-side
// counterparts. It's not clear yet whether it's nvcc's bug or		// counterparts. It's not clear yet whether it's nvcc's bug or
// a feature, but we've got to do the same for compatibility.		// a feature, but we've got to do the same for compatibility.
▲ Show 20 Lines • Show All 1,994 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/device-stub.cu

	Show All 23 Lines
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -target-sdk-version=9.2 -fcuda-include-gpubinary %t -o - -DNOGLOBALS \			// RUN: -target-sdk-version=9.2 -fcuda-include-gpubinary %t -o - -DNOGLOBALS \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s \
	// RUN: --check-prefixes=NOGLOBALS,CUDANOGLOBALS			// RUN: --check-prefixes=NOGLOBALS,CUDANOGLOBALS
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -target-sdk-version=9.2 -fgpu-rdc -fcuda-include-gpubinary %t -o - \			// RUN: -target-sdk-version=9.2 -fgpu-rdc -fcuda-include-gpubinary %t -o - \
	// RUN: \| FileCheck %s -allow-deprecated-dag-overlap \			// RUN: \| FileCheck %s -allow-deprecated-dag-overlap \
	// RUN: --check-prefixes=ALL,LNX,RDC,CUDA,CUDARDC,CUDA_NEW			// RUN: --check-prefixes=ALL,LNX,RDC,CUDA,CUDARDC,CUDA_NEW
				// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s -std=c++17 \
				// RUN: -target-sdk-version=9.2 -fgpu-rdc -fcuda-include-gpubinary %t -o - \
				// RUN: \| FileCheck %s -allow-deprecated-dag-overlap \
				// RUN: --check-prefixes=ALL,LNX,RDC,CUDA,CUDARDC,CUDA_NEW,LNX_17
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -target-sdk-version=9.2 -o - \			// RUN: -target-sdk-version=9.2 -o - \
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=NOGPUBIN			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=NOGPUBIN

	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	// RUN: -fcuda-include-gpubinary %t -o - -x hip\			// RUN: -fcuda-include-gpubinary %t -o - -x hip\
	// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,LNX,NORDC,HIP,HIPEF			// RUN: \| FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,LNX,NORDC,HIP,HIPEF
	// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \			// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
	▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	// LNX-DAG: @ext_device_var_def = internal global i32 undef,			// LNX-DAG: @ext_device_var_def = internal global i32 undef,
	// WIN-DAG: @"?ext_device_var_def@@3HA" = internal global i32 undef			// WIN-DAG: @"?ext_device_var_def@@3HA" = internal global i32 undef
	extern __device__ int ext_device_var_def;			extern __device__ int ext_device_var_def;
	__device__ int ext_device_var_def = 1;			__device__ int ext_device_var_def = 1;
	// LNX-DAG: @ext_device_var_def = internal global i32 undef,			// LNX-DAG: @ext_device_var_def = internal global i32 undef,
	// WIN-DAG: @"?ext_constant_var_def@@3HA" = internal global i32 undef			// WIN-DAG: @"?ext_constant_var_def@@3HA" = internal global i32 undef
	__constant__ int ext_constant_var_def = 2;			__constant__ int ext_constant_var_def = 2;

				#if __cplusplus > 201402L
				/// FIXME: Reject __device__ constexpr and inline variables in Sema.
				// LNX_17: @inline_var = internal global i32 undef, comdat, align 4{{$}}
				// LNX_17: @_ZN1C17member_inline_varE = internal constant i32 undef, comdat, align 4{{$}}
				__device__ inline int inline_var = 3;
				struct C {
				__device__ static constexpr int member_inline_var = 4;
				};
				#endif

	void use_pointers() {			void use_pointers() {
	int *p;			const int *p;
	p = &device_var;			p = &device_var;
	p = &constant_var;			p = &constant_var;
	p = &shared_var;			p = &shared_var;
	p = &host_var;			p = &host_var;
	p = &ext_device_var;			p = &ext_device_var;
	p = &ext_constant_var;			p = &ext_constant_var;
	p = &ext_host_var;			p = &ext_host_var;
				#if __cplusplus > 201402L
				p = &inline_var;
				p = &C::member_inline_var;
				#endif
	}			}

	// Make sure that all parts of GPU code init/cleanup are there:			// Make sure that all parts of GPU code init/cleanup are there:
	// * constant unnamed string with the device-side kernel name to be passed to			// * constant unnamed string with the device-side kernel name to be passed to
	// __hipRegisterFunction/__cudaRegisterFunction.			// __hipRegisterFunction/__cudaRegisterFunction.
	// ALL: @0 = private unnamed_addr constant [18 x i8] c"_Z10kernelfunciii\00"			// ALL: @0 = private unnamed_addr constant [18 x i8] c"_Z10kernelfunciii\00"
	// * constant unnamed string with the device-side kernel name to be passed to			// * constant unnamed string with the device-side kernel name to be passed to
	// __hipRegisterVar/__cudaRegisterVar.			// __hipRegisterVar/__cudaRegisterVar.
	▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines

	// Test that we've built a function to register kernels and global vars.			// Test that we've built a function to register kernels and global vars.
	// ALL: define internal void @__[[PREFIX]]_register_globals			// ALL: define internal void @__[[PREFIX]]_register_globals
	// ALL: call{{.}}[[PREFIX]]RegisterFunction(i8* %0, {{.}}kernelfunc{{[^,]}}, {{[^@]*}}@0			// ALL: call{{.}}[[PREFIX]]RegisterFunction(i8* %0, {{.}}kernelfunc{{[^,]}}, {{[^@]*}}@0
	// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}device_var{{[^,]}}, {{[^@]}}@1, {{.}}i32 0, {{i32\|i64}} 4, i32 0, i32 0			// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}device_var{{[^,]}}, {{[^@]}}@1, {{.}}i32 0, {{i32\|i64}} 4, i32 0, i32 0
	// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}constant_var{{[^,]}}, {{[^@]}}@2, {{.}}i32 0, {{i32\|i64}} 4, i32 1, i32 0			// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}constant_var{{[^,]}}, {{[^@]}}@2, {{.}}i32 0, {{i32\|i64}} 4, i32 1, i32 0
	// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}ext_device_var_def{{[^,]}}, {{[^@]}}@3, {{.}}i32 0, {{i32\|i64}} 4, i32 0, i32 0			// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}ext_device_var_def{{[^,]}}, {{[^@]}}@3, {{.}}i32 0, {{i32\|i64}} 4, i32 0, i32 0
	// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}ext_constant_var_def{{[^,]}}, {{[^@]}}@4, {{.}}i32 0, {{i32\|i64}} 4, i32 1, i32 0			// ALL-DAG: call void {{.}}[[PREFIX]]RegisterVar(i8* %0, {{.}}ext_constant_var_def{{[^,]}}, {{[^@]}}@4, {{.}}i32 0, {{i32\|i64}} 4, i32 1, i32 0
				// LNX_17-NOT: [[PREFIX]]RegisterVar(i8** %0, {{.*}}inline_var
	// ALL: ret void			// ALL: ret void

	// Test that we've built a constructor.			// Test that we've built a constructor.
	// LNX: define internal void @__[[PREFIX]]_module_ctor			// LNX: define internal void @__[[PREFIX]]_module_ctor

	// In separate mode it calls __[[PREFIX]]RegisterFatBinary(&__[[PREFIX]]_fatbin_wrapper)			// In separate mode it calls __[[PREFIX]]RegisterFatBinary(&__[[PREFIX]]_fatbin_wrapper)
	// HIP only register fat binary once.			// HIP only register fat binary once.
	// HIP: load i8, i8* @__hip_gpubin_handle			// HIP: load i8, i8* @__hip_gpubin_handle
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines