This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Call atexit() for CUDA destructor early on.
AbandonedPublic

Authored by tra on Jul 24 2018, 3:17 PM.

Details

Reviewers
jlebar
timshen
Summary

There's apparently a race between fatbin destructors registered by us
and some internal calls registered by CUDA runtime from cudaRegisterFatbin.
Moving fatbin de-registration to atexit() was not sufficient to avoid crash in
CUDA runtime on exit when the runtime was linked statically, but CUDA
kernel was launched from a shared library.

Moving atexit() call to before we call cudaRegisterFatbin appears to work
with both statically and dynamically linked CUDA TUs.

Event Timeline

tra created this revision.Jul 24 2018, 3:17 PM
jlebar accepted this revision.Jul 24 2018, 3:36 PM
jlebar added inline comments.
clang/lib/CodeGen/CGCUDANV.cpp
379

the regular destructor phase

380

a double-free

This revision is now accepted and ready to land.Jul 24 2018, 3:36 PM
joerg added a subscriber: joerg.Jul 24 2018, 3:52 PM

Can this ever end up in a shared library? If yes, please use the normal logic for creating a global destructor. atexit is not very friendly to dlopen...

tra added a comment.Jul 24 2018, 4:11 PM

Can this ever end up in a shared library? If yes, please use the normal logic for creating a global destructor. atexit is not very friendly to dlopen...

Yes, it can end up in a shared library. What would be the normal logic in this case?

We used to use regular global destructor, but has even worse issues. Alas, NVIDIA provides no documentation to how compiler-generated glue is expected to interact with CUDA runtime, so we need to guess what it wants.
NVCC-generated glue generates call to atexit(). If we use global destructors, then by the time they are executed, nvidia's runtime has already been deinitialized and our attempt to call it causes the crash.
Deregistering fatbin from atexit() works better, but apparently we still race with the runtime. calling atexit() before we register the fatbin appears to work for all combinations of {static/dynamic, kernel/runtime}.

joerg added a comment.Jul 24 2018, 4:24 PM

Depends a bit on the platform, __cxa_atexit on most modern ELF systems, fallback to atexit. If the global dtor is run too late, it smells like a missing library dependency. They are executed in topological order after all.

tra planned changes to this revision.Jul 24 2018, 4:54 PM

Ugh. Apparently moving this code up just disabled module destructor. :-( That explains why we no longer crash.

tra abandoned this revision.Jul 30 2018, 3:59 PM

It appears that the issue that originally prompted this change is due to suspected bug in glibc triggered by specific details of our internal build.