This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/AST/
-
clang/
-
AST/
-
ASTContext.h
-
lib/
-
CodeGen/
2/2
CodeGenModule.cpp
-
Sema/
-
SemaCUDA.cpp
-
SemaExpr.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
host-used-extern.cu

Differential D123441

[CUDA][HIP] Fix host used external kernel in archive
ClosedPublic

Authored by yaxunl on Apr 8 2022, 9:15 PM.

Download Raw Diff

Details

Reviewers

tra

Commits

rG0424b5115cff: [CUDA][HIP] Fix host used external kernel in archive

Summary

For -fgpu-rdc, a host function may call an external kernel
which is defined in an archive of bitcode. Since this external
kernel is only referenced in host function, the device
bitcode does not contain reference to this external
kernel, then the linker will not try to resolve this external
kernel in the archive.

To fix this issue, host-used external kernels and device
variables are tracked. A global array containing pointers
to these external kernels and variables is emitted which
serves as an artificial references to the external kernels
and variables used by host.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yaxunl created this revision.Apr 8 2022, 9:15 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:15 PM

yaxunl requested review of this revision.Apr 8 2022, 9:15 PM

Harbormaster completed remote builds in B158819: Diff 421680.Apr 8 2022, 10:07 PM

LGTM in principle. This will keep around the GPU code we do need.

That said, it seems to be a rather blunt hammer. I think we'll end up linking almost everything in an archive into the final executable as we'll likely have a host-visible symbol in most of the GPU objects (e.g. most of them would have a kernel).
Device-side linking would also be unaware of which objects were actually linked into the host executable and thus would link in more objects than necessary. We could have achieved about the same result by linking with --whole-archive.

The root of the problem here is that in isolation GPU-side linking does not know what will really be needed by the host and thus has to link in everything, except, maybe, object files where we may have __device__ functions only.
Ideally, the linking should be a two-phase process -- link CPU side, extract references to the GPU symbols (host-side compilation would have to be augmented to place them in a well known location) and pass them to the GPU-side linker which would then have all the info necessary to pull in relevant GPU-side objects without compiler having to force having nearly all of them linked in.

I realize that this would be a nontrivial change to the compilation pipeline. As a short-to-medium term solution, this patch may do, though I'd probably prefer just linking with --whole-archive as it would, in theory, be simpler.

In D123441#3446408, @tra wrote:

LGTM in principle. This will keep around the GPU code we do need.

That said, it seems to be a rather blunt hammer. I think we'll end up linking almost everything in an archive into the final executable as we'll likely have a host-visible symbol in most of the GPU objects (e.g. most of them would have a kernel).
Device-side linking would also be unaware of which objects were actually linked into the host executable and thus would link in more objects than necessary. We could have achieved about the same result by linking with --whole-archive.

The root of the problem here is that in isolation GPU-side linking does not know what will really be needed by the host and thus has to link in everything, except, maybe, object files where we may have __device__ functions only.
Ideally, the linking should be a two-phase process -- link CPU side, extract references to the GPU symbols (host-side compilation would have to be augmented to place them in a well known location) and pass them to the GPU-side linker which would then have all the info necessary to pull in relevant GPU-side objects without compiler having to force having nearly all of them linked in.

I realize that this would be a nontrivial change to the compilation pipeline. As a short-to-medium term solution, this patch may do, though I'd probably prefer just linking with --whole-archive as it would, in theory, be simpler.

This approach will only link in kernels and device variables used by host code, whereas --whole-archive will keep everything in the archive. There are use cases where the archive contains a large amount of kernels that the application only use a few of them.

Also, --whole-archive will require users to carefully arrange --whole-archive and --no-whole-archive options for the archives they use. This approach avoids that.

tra mentioned this in D123471: [CUDA] Create offloading entries when using the new driver.Apr 12 2022, 12:33 PM

This approach will only link in kernels and device variables used by host code

In the absence of the explicit reference info from the host side, GPU-side linker must link all objects with symbols that may be used by the host.
E.g if we have a library with three objects, each has one kernel (and thus potentially used by the host), but the main TU only refers to a kernel from one of them, GPU-side linker would still have to link in all three objects from the library, as any of them may have been referenced by the host.

--whole-archive will require users to carefully arrange --whole-archive and --no-whole-archive options for the archives they use.

This would be done by the driver. My understanding is that we already have to do nontrivial stuff under the hood (e.g. unbundling) so telling the linker that static archives must always use --whole-archive should be doable.
I don't insist on it, just exploring alternative options we may have.

In D123441#3446478, @tra wrote:

This approach will only link in kernels and device variables used by host code

In the absence of the explicit reference info from the host side, GPU-side linker must link all objects with symbols that may be used by the host.
E.g if we have a library with three objects, each has one kernel (and thus potentially used by the host), but the main TU only refers to a kernel from one of them, GPU-side linker would still have to link in all three objects from the library, as any of them may have been referenced by the host.

You are talking about a use case that actually needs --whole-archive option. If the main TU does not reference some symbols in the archive but wants all symbols in the archive to be linked in, it is justifiable to use --whole-archive and HIP toolchain can support passing -Wl,--whole-archive specified in the command line to the device linking step.

However, in normal use cases, users only want to link in symbols referenced by the main TU. They do not need to link every symbol in the archive.

Also, I don't see the advantage of resolving this issue through toolchains. You still need to detect kernels and device variables referenced by host code, and generate IR's which introduce artificial references to them. It just becomes more complicated since you have to do them with external tools and handle extra outputs and inputs with the toolchain. Whereas in the current approach the information is directly available in AST and the IR can be generated by clang codegen.

In D123441#3446499, @yaxunl wrote:

You are talking about a use case that actually needs --whole-archive option. If the main TU does not reference some symbols in the archive but wants all symbols in the archive to be linked in, it is justifiable to use --whole-archive and HIP toolchain can support passing -Wl,--whole-archive specified in the command line to the device linking step.

Hmm. My point was the opposite -- only one object should be linked and I saw no way to do that without conservatively including everything.
I think I've misunderstood what your patch does.

So, a main TU with just __global__ void kernel(); would emit a reference when it's compiled on the GPU side. That, in turn will tell the linker what it needs to pull from the libraries and things should just work.
If that's the case, then it would work in my example, too.

However, in normal use cases, users only want to link in symbols referenced by the main TU. They do not need to link every symbol in the archive.

Agreed.

clang/lib/CodeGen/CodeGenModule.cpp
602	This is not HIP-specific and should have a more generic name. `@gpu.used.external` ?

This revision is now accepted and ready to land.Apr 12 2022, 2:13 PM

In D123441#3446719, @tra wrote:

So, a main TU with just __global__ void kernel(); would emit a reference when it's compiled on the GPU side. That, in turn will tell the linker what it needs to pull from the libraries and things should just work.
If that's the case, then it would work in my example, too.

Yes, this patch creates artificial references in the device IR originating from the host functions in the same TU, or, in other words, it creates the missing references which should be there but are not there and no more references than those.

clang/lib/CodeGen/CodeGenModule.cpp
602	will do when committing.

This revision was landed with ongoing or failed builds.Apr 13 2022, 7:48 AM

Closed by commit rG0424b5115cff: [CUDA][HIP] Fix host used external kernel in archive (authored by yaxunl). · Explain Why

This revision was automatically updated to reflect the committed changes.

yaxunl marked an inline comment as done.

yaxunl added a commit: rG0424b5115cff: [CUDA][HIP] Fix host used external kernel in archive.

Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 7:48 AM

GitHub <noreply@github.com> mentioned this in rGf2a1331a01ff: [CUDA][HIP] Do not mark extern shared var (#65990).Sep 11 2023, 2:05 PM

Revision Contents

Path

Size

clang/

include/

clang/

AST/

ASTContext.h

4 lines

lib/

CodeGen/

CodeGenModule.cpp

24 lines

Sema/

SemaCUDA.cpp

7 lines

SemaExpr.cpp

8 lines

test/

CodeGenCUDA/

host-used-extern.cu

51 lines

Diff 422519

clang/include/clang/AST/ASTContext.h

Show First 20 Lines • Show All 1,154 Lines • ▼ Show 20 Lines	#include "clang/Basic/RISCVVTypes.def"
mutable Decl *VaListTagDecl = nullptr;		mutable Decl *VaListTagDecl = nullptr;

// Implicitly-declared type 'struct _GUID'.		// Implicitly-declared type 'struct _GUID'.
mutable TagDecl *MSGuidTagDecl = nullptr;		mutable TagDecl *MSGuidTagDecl = nullptr;

/// Keep track of CUDA/HIP device-side variables ODR-used by host code.		/// Keep track of CUDA/HIP device-side variables ODR-used by host code.
llvm::DenseSet<const VarDecl *> CUDADeviceVarODRUsedByHost;		llvm::DenseSet<const VarDecl *> CUDADeviceVarODRUsedByHost;

		/// Keep track of CUDA/HIP external kernels or device variables ODR-used by
		/// host code.
		llvm::DenseSet<const ValueDecl *> CUDAExternalDeviceDeclODRUsedByHost;

ASTContext(LangOptions &LOpts, SourceManager &SM, IdentifierTable &idents,		ASTContext(LangOptions &LOpts, SourceManager &SM, IdentifierTable &idents,
SelectorTable &sels, Builtin::Context &builtins,		SelectorTable &sels, Builtin::Context &builtins,
TranslationUnitKind TUKind);		TranslationUnitKind TUKind);
ASTContext(const ASTContext &) = delete;		ASTContext(const ASTContext &) = delete;
ASTContext &operator=(const ASTContext &) = delete;		ASTContext &operator=(const ASTContext &) = delete;
~ASTContext();		~ASTContext();

/// Attach an external AST source to the AST context.		/// Attach an external AST source to the AST context.
▲ Show 20 Lines • Show All 2,251 Lines • Show Last 20 Lines

clang/lib/CodeGen/CodeGenModule.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 573 Lines • ▼ Show 20 Lines	if (getTriple().isAMDGPU()) {
// library is ready.		// library is ready.
if (getTarget().getTargetOpts().CodeObjectVersion == TargetOptions::COV_5) {		if (getTarget().getTargetOpts().CodeObjectVersion == TargetOptions::COV_5) {
getModule().addModuleFlag(llvm::Module::Error,		getModule().addModuleFlag(llvm::Module::Error,
"amdgpu_code_object_version",		"amdgpu_code_object_version",
getTarget().getTargetOpts().CodeObjectVersion);		getTarget().getTargetOpts().CodeObjectVersion);
}		}
}		}

		// Emit a global array containing all external kernels or device variables
		// used by host functions and mark it as used for CUDA/HIP. This is necessary
		// to get kernels or device variables in archives linked in even if these
		// kernels or device variables are only used in host functions.
		if (!Context.CUDAExternalDeviceDeclODRUsedByHost.empty()) {
		SmallVector<llvm::Constant *, 8> UsedArray;
		for (auto D : Context.CUDAExternalDeviceDeclODRUsedByHost) {
		GlobalDecl GD;
		if (auto *FD = dyn_cast<FunctionDecl>(D))
		GD = GlobalDecl(FD, KernelReferenceKind::Kernel);
		else
		GD = GlobalDecl(D);
		UsedArray.push_back(llvm::ConstantExpr::getPointerBitCastOrAddrSpaceCast(
		GetAddrOfGlobal(GD), Int8PtrTy));
		}

		llvm::ArrayType *ATy = llvm::ArrayType::get(Int8PtrTy, UsedArray.size());

		auto *GV = new llvm::GlobalVariable(
		getModule(), ATy, false, llvm::GlobalValue::AppendingLinkage,
		llvm::ConstantArray::get(ATy, UsedArray), "gpu.used.external");
		traUnsubmitted Done Reply Inline Actions This is not HIP-specific and should have a more generic name. `@gpu.used.external` ? tra: This is not HIP-specific and should have a more generic name. `@gpu.used.external` ?
		yaxunlAuthorUnsubmitted Done Reply Inline Actions will do when committing. yaxunl: will do when committing.
		addCompilerUsedGlobal(GV);
		}

emitLLVMUsed();		emitLLVMUsed();
if (SanStats)		if (SanStats)
SanStats->finish();		SanStats->finish();

if (CodeGenOpts.Autolink &&		if (CodeGenOpts.Autolink &&
(Context.getLangOpts().Modules \|\| !LinkerOptionsMetadata.empty())) {		(Context.getLangOpts().Modules \|\| !LinkerOptionsMetadata.empty())) {
EmitModuleLinkOptions();		EmitModuleLinkOptions();
}		}
▲ Show 20 Lines • Show All 6,192 Lines • Show Last 20 Lines

clang/lib/Sema/SemaCUDA.cpp

Show First 20 Lines • Show All 813 Lines • ▼ Show 20 Lines	case CFP_WrongSide:
return CallerKnownEmitted		return CallerKnownEmitted
? SemaDiagnosticBuilder::K_ImmediateWithCallStack		? SemaDiagnosticBuilder::K_ImmediateWithCallStack
: SemaDiagnosticBuilder::K_Deferred;		: SemaDiagnosticBuilder::K_Deferred;
default:		default:
return SemaDiagnosticBuilder::K_Nop;		return SemaDiagnosticBuilder::K_Nop;
}		}
}();		}();

if (DiagKind == SemaDiagnosticBuilder::K_Nop)		if (DiagKind == SemaDiagnosticBuilder::K_Nop) {
		// For -fgpu-rdc, keep track of external kernels used by host functions.
		if (LangOpts.CUDAIsDevice && LangOpts.GPURelocatableDeviceCode &&
		Callee->hasAttr<CUDAGlobalAttr>() && !Callee->isDefined())
		getASTContext().CUDAExternalDeviceDeclODRUsedByHost.insert(Callee);
return true;		return true;
		}

// Avoid emitting this error twice for the same location. Using a hashtable		// Avoid emitting this error twice for the same location. Using a hashtable
// like this is unfortunate, but because we must continue parsing as normal		// like this is unfortunate, but because we must continue parsing as normal
// after encountering a deferred error, it's otherwise very tricky for us to		// after encountering a deferred error, it's otherwise very tricky for us to
// ensure that we only emit this deferred error once.		// ensure that we only emit this deferred error once.
if (!LocsWithCUDACallDiags.insert({Caller, Loc}).second)		if (!LocsWithCUDACallDiags.insert({Caller, Loc}).second)
return true;		return true;

▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

clang/lib/Sema/SemaExpr.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,902 Lines • ▼ Show 20 Lines	if (VarTarget == Sema::CVT_Host &&
<< /host/ 2 << /variable/ 1 << Var << UserTarget;		<< /host/ 2 << /variable/ 1 << Var << UserTarget;
SemaRef.targetDiag(Var->getLocation(),		SemaRef.targetDiag(Var->getLocation(),
Var->getType().isConstQualified()		Var->getType().isConstQualified()
? diag::note_cuda_const_var_unpromoted		? diag::note_cuda_const_var_unpromoted
: diag::note_cuda_host_var);		: diag::note_cuda_host_var);
}		}
} else if (VarTarget == Sema::CVT_Device &&		} else if (VarTarget == Sema::CVT_Device &&
(UserTarget == Sema::CFT_Host \|\|		(UserTarget == Sema::CFT_Host \|\|
UserTarget == Sema::CFT_HostDevice) &&		UserTarget == Sema::CFT_HostDevice)) {
!Var->hasExternalStorage()) {
// Record a CUDA/HIP device side variable if it is ODR-used		// Record a CUDA/HIP device side variable if it is ODR-used
// by host code. This is done conservatively, when the variable is		// by host code. This is done conservatively, when the variable is
// referenced in any of the following contexts:		// referenced in any of the following contexts:
// - a non-function context		// - a non-function context
// - a host function		// - a host function
// - a host device function		// - a host device function
// This makes the ODR-use of the device side variable by host code to		// This makes the ODR-use of the device side variable by host code to
// be visible in the device compilation for the compiler to be able to		// be visible in the device compilation for the compiler to be able to
// emit template variables instantiated by host code only and to		// emit template variables instantiated by host code only and to
// externalize the static device side variable ODR-used by host code.		// externalize the static device side variable ODR-used by host code.
		if (!Var->hasExternalStorage())
SemaRef.getASTContext().CUDADeviceVarODRUsedByHost.insert(Var);		SemaRef.getASTContext().CUDADeviceVarODRUsedByHost.insert(Var);
		else if (SemaRef.LangOpts.GPURelocatableDeviceCode)
		SemaRef.getASTContext().CUDAExternalDeviceDeclODRUsedByHost.insert(Var);
}		}
}		}

Var->markUsed(SemaRef.Context);		Var->markUsed(SemaRef.Context);
}		}

void Sema::MarkCaptureUsedInEnclosingContext(VarDecl *Capture,		void Sema::MarkCaptureUsedInEnclosingContext(VarDecl *Capture,
SourceLocation Loc,		SourceLocation Loc,
▲ Show 20 Lines • Show All 2,533 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/host-used-extern.cu

This file was added.

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -x hip %s \
				// RUN: -fgpu-rdc -std=c++11 -emit-llvm -o - -target-cpu gfx906 \| FileCheck %s

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -x hip %s \
				// RUN: -fgpu-rdc -std=c++11 -emit-llvm -o - -target-cpu gfx906 \
				// RUN: \| FileCheck -check-prefix=NEG %s

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -x hip %s \
				// RUN: -std=c++11 -emit-llvm -o - -target-cpu gfx906 \
				// RUN: \| FileCheck -check-prefixes=NEG,NORDC %s

				#include "Inputs/cuda.h"

				// CHECK-LABEL: @gpu.used.external = appending {{.*}}global
				// CHECK-DAG: @_Z7kernel1v
				// CHECK-DAG: @_Z7kernel4v
				// CHECK-DAG: @var1
				// CHECK-LABEL: @llvm.compiler.used = {{.*}} @gpu.used.external

				// NEG-NOT: @gpu.used.external = {{.*}} @_Z7kernel2v
				// NEG-NOT: @gpu.used.external = {{.*}} @_Z7kernel3v
				// NEG-NOT: @gpu.used.external = {{.*}} @var2
				// NEG-NOT: @gpu.used.external = {{.*}} @var3
				// NORDC-NOT: @gpu.used.external = {{.*}} @_Z7kernel1v
				// NORDC-NOT: @gpu.used.external = {{.*}} @_Z7kernel4v
				// NORDC-NOT: @gpu.used.external = {{.*}} @var1

				__global__ void kernel1();

				// kernel2 is not marked as used since it is a definition.
				__global__ void kernel2() {}

				// kernel3 is not marked as used since it is not called by host function.
				__global__ void kernel3();

				// kernel4 is marked as used even though it is not called.
				__global__ void kernel4();

				extern __device__ int var1;

				__device__ int var2;

				extern __device__ int var3;

				void use(int *p);

				void test() {
				kernel1<<<1, 1>>>();
				void p = (void)kernel4;
				use(&var1);
				}