This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Call atexit() for CUDA destructor early on.
AbandonedPublic

Authored by tra on Jul 24 2018, 3:17 PM.

Download Raw Diff

Details

Reviewers

jlebar
timshen

Summary

There's apparently a race between fatbin destructors registered by us
and some internal calls registered by CUDA runtime from cudaRegisterFatbin.
Moving fatbin de-registration to atexit() was not sufficient to avoid crash in
CUDA runtime on exit when the runtime was linked statically, but CUDA
kernel was launched from a shared library.

Moving atexit() call to before we call cudaRegisterFatbin appears to work
with both statically and dynamically linked CUDA TUs.

Diff Detail

Build Status

Buildable 20675
Build 20675: arc lint + arc unit

Event Timeline

tra created this revision.Jul 24 2018, 3:17 PM

Herald added subscribers: bixia, sanjoy. · View Herald TranscriptJul 24 2018, 3:17 PM

jlebar accepted this revision.Jul 24 2018, 3:36 PM

jlebar added inline comments.

clang/lib/CodeGen/CGCUDANV.cpp
379	the regular destructor phase
380	a double-free

This revision is now accepted and ready to land.Jul 24 2018, 3:36 PM

Can this ever end up in a shared library? If yes, please use the normal logic for creating a global destructor. atexit is not very friendly to dlopen...

In D49763#1174283, @joerg wrote:

Can this ever end up in a shared library? If yes, please use the normal logic for creating a global destructor. atexit is not very friendly to dlopen...

Yes, it can end up in a shared library. What would be the normal logic in this case?

We used to use regular global destructor, but has even worse issues. Alas, NVIDIA provides no documentation to how compiler-generated glue is expected to interact with CUDA runtime, so we need to guess what it wants.
NVCC-generated glue generates call to atexit(). If we use global destructors, then by the time they are executed, nvidia's runtime has already been deinitialized and our attempt to call it causes the crash.
Deregistering fatbin from atexit() works better, but apparently we still race with the runtime. calling atexit() before we register the fatbin appears to work for all combinations of {static/dynamic, kernel/runtime}.

Depends a bit on the platform, __cxa_atexit on most modern ELF systems, fallback to atexit. If the global dtor is run too late, it smells like a missing library dependency. They are executed in topological order after all.

Ugh. Apparently moving this code up just disabled module destructor. :-( That explains why we no longer crash.

It appears that the issue that originally prompted this change is due to suspected bug in glibc triggered by specific details of our internal build.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGCUDANV.cpp

26 lines

Diff 157141

clang/lib/CodeGen/CGCUDANV.cpp

Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	llvm::Function *ModuleCtorFunc = llvm::Function::Create(
llvm::GlobalValue::InternalLinkage,		llvm::GlobalValue::InternalLinkage,
addUnderscoredPrefixToName("_module_ctor"), &TheModule);		addUnderscoredPrefixToName("_module_ctor"), &TheModule);
llvm::BasicBlock *CtorEntryBB =		llvm::BasicBlock *CtorEntryBB =
llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);		llvm::BasicBlock::Create(Context, "entry", ModuleCtorFunc);
CGBuilderTy CtorBuilder(CGM, Context);		CGBuilderTy CtorBuilder(CGM, Context);

CtorBuilder.SetInsertPoint(CtorEntryBB);		CtorBuilder.SetInsertPoint(CtorEntryBB);

		// Create destructor and register it with atexit() the way NVCC does it. Doing
		// it during regular destructor phase worked in CUDA before 9.2 but results in
		jlebarUnsubmitted Not Done Reply Inline Actions the regular destructor phase jlebar: the regular destructor phase
		// double-free in 9.2.
		jlebarUnsubmitted Not Done Reply Inline Actions a double-free jlebar: a double-free
		if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
		// extern "C" int atexit(void (*f)(void));
		llvm::FunctionType *AtExitTy =
		llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
		llvm::Constant *AtExitFunc =
		CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
		/Local=/true);
		CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
		}

const char *FatbinConstantName;		const char *FatbinConstantName;
const char *FatbinSectionName;		const char *FatbinSectionName;
const char *ModuleIDSectionName;		const char *ModuleIDSectionName;
StringRef ModuleIDPrefix;		StringRef ModuleIDPrefix;
llvm::Constant *FatBinStr;		llvm::Constant *FatBinStr;
unsigned FatMagic;		unsigned FatMagic;
if (IsHIP) {		if (IsHIP) {
FatbinConstantName = ".hip_fatbin";		FatbinConstantName = ".hip_fatbin";
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	if (IsHIP) {
assert(RegisterGlobalsFunc && "Expecting at least dummy function!");		assert(RegisterGlobalsFunc && "Expecting at least dummy function!");
llvm::Value *Args[] = {RegisterGlobalsFunc,		llvm::Value *Args[] = {RegisterGlobalsFunc,
CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy),		CtorBuilder.CreateBitCast(FatbinWrapper, VoidPtrTy),
ModuleIDConstant,		ModuleIDConstant,
makeDummyFunction(getCallbackFnTy())};		makeDummyFunction(getCallbackFnTy())};
CtorBuilder.CreateCall(RegisterLinkedBinaryFunc, Args);		CtorBuilder.CreateCall(RegisterLinkedBinaryFunc, Args);
}		}

// Create destructor and register it with atexit() the way NVCC does it. Doing
// it during regular destructor phase worked in CUDA before 9.2 but results in
// double-free in 9.2.
if (llvm::Function *CleanupFn = makeModuleDtorFunction()) {
// extern "C" int atexit(void (*f)(void));
llvm::FunctionType *AtExitTy =
llvm::FunctionType::get(IntTy, CleanupFn->getType(), false);
llvm::Constant *AtExitFunc =
CGM.CreateRuntimeFunction(AtExitTy, "atexit", llvm::AttributeList(),
/Local=/true);
CtorBuilder.CreateCall(AtExitFunc, CleanupFn);
}

CtorBuilder.CreateRetVoid();		CtorBuilder.CreateRetVoid();
return ModuleCtorFunc;		return ModuleCtorFunc;
}		}

/// Creates a global destructor function that unregisters the GPU code blob		/// Creates a global destructor function that unregisters the GPU code blob
/// registered by constructor.		/// registered by constructor.
///		///
/// For CUDA:		/// For CUDA:
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines