There's apparently a race between fatbin destructors registered by us
and some internal calls registered by CUDA runtime from cudaRegisterFatbin.
Moving fatbin de-registration to atexit() was not sufficient to avoid crash in
CUDA runtime on exit when the runtime was linked statically, but CUDA
kernel was launched from a shared library.
Moving atexit() call to before we call cudaRegisterFatbin appears to work
with both statically and dynamically linked CUDA TUs.
the regular destructor phase