Currently on some platforms the program might crash when the program
exits. It is because the memory manager calls plugin interface to release its
buffered memory and the plugin has already been destroyed. After some search
online, it seems that loaded dynamic libraries will be destroyed in the reverse
order of loading, which means by no means, those plugin libraries loaded by
dlopen will be destroyed after libomtarget. We need something to tell the
RTLInfoTy that the hold plugin is not valid anymore so don't call those
interfaces. This is done by registering a callback function in the plugin so that
it will be called when the plugin starts to destroy itself. The callback function
for now only does one thing: release the memory manager so that all buffered
target memory will be released. We can add more features in the future if
needed.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
openmp/libomptarget/src/MemoryManager.cpp | ||
---|---|---|
113 | Can't we do this in the destructor MemoryManagerTy::~MemoryManagerTy() ? |
openmp/libomptarget/src/MemoryManager.cpp | ||
---|---|---|
113 | Nice point! Did that. |
One nit, works for me otherwise. Anyone else?
openmp/libomptarget/src/MemoryManager.cpp | ||
---|---|---|
101 | Early exit or braces please, former preferred. |
I suppose this is better, though it seems backwards to me. Plus we still leak memory with this design.
I was hoping the plugin would expose a function that libomptarget calls to do the cleanup, in place of today's global destructor. Libomptarget can then do an ordered cleanup, after the memory manager destructor has run.
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | Can we just implement this instead? |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | At this point, the plugin has already been destroyed. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | How is that possible, since we never call dlclose() for the plugins? Should we just "deinitialize" the memory manager here instead of postponing the "deinitialization" to MemoryManagerTy desctructor, which will obviously be executed at the wrong time. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | That's most tough part of this question. libomptarget has no idea when its opened plugin will be destroyed. Actually, you can see there is no dlclose in current code. What we can know is, plugin will be destroyed before libomptarget. You can easily verify it by adding DP(...) to the destructor of the plugin. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | This sounds really strange at least for Linux, since:
Do you have a reproducer for this kind of an issue? I must be missing something here... |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | You can use openmp/libomptarget/test/offloading/memory_manager.cpp and turn debug print on via setting env LIBOMPTARGET_DEBUG to a large number like 10. Then you can see a lot of logs. ... Libomptarget --> omp_target_free deallocated device ptr PASS (1) Target CUDA RTL --> CUDA plugin destroyed (2) Libomptarget --> __tgt_unregister_lib Libomptarget --> Unloading target library! Libomptarget --> Image 0x0000000000403870 is compatible with RTL 0x0000000000e4a8b0! Libomptarget --> Unregistered image 0x0000000000403870 from RTL 0x0000000000e4a8b0! Libomptarget --> Done unregistering images! Libomptarget --> Removing translation table for descriptor 0x00000000004abb60 Libomptarget --> Done unregistering library! Libomptarget --> Deinit target library! Target CUDA RTL --> Error returned from cuCtxSetCurrent Target CUDA RTL --> Unresolved CUDA error code: 4 Target CUDA RTL --> Unsuccessful cuGetErrorString return status: 4 Target CUDA RTL --> Error returned from cuCtxSetCurrent ... Line (1) and (2) were printed because I added DP to the destructor of CUDA plugin and function __tgt_unregister_lib. The last tens of lines are all error messages because CUDA plugin was destroyed before libomptarget. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | Ah, I see. The global destructor for DeviceRTL is called before __tgt_unregister_lib is called. The plugin (the shared object) is not actually deallocated yet. Thank you for the explanation. I am not sure libomptarget has to be involved here, since this may be done exclusively at the plugin level, i.e. DeviceRTL may be dynamically allocated/deallocated using global init/fini functions, and then all plugin APIs may return early, if they are invoked after DeviceRTL has been deallocated. Just to be clear, I am not against this change-set :) Anyway, libomptarget has to be redesigned a bit because the plugins loading initiated in a global constructor (which calls RTLsTy::RegisterLib) is violating Microsoft conventions for DLL initialization (e.g. you cannot create a DLL with offload and then LoadLibrary this DLL, because this will result in LoadLibrary->...->LoadLibrary call stack, which is not allowed). We should probably load the plugins on the first OpenMP construct/API encountered during the execution. I assume we will have to define clear entry/exit points for plugins then, and this will solve the issue that you are addressing here. |
openmp/libomptarget/src/rtl.cpp | ||
---|---|---|
444–445 | libomptarget is involved because we have a memory manager which will releases all its buffered memory in its destructor. It is an object in libomptarget. So its destructor is invoked when libomptarget is being destroyed. Now we know at this moment the plugin has already been destroyed, and the memory manager still calls plugin interfaces, causing program crash on some platforms. However, usually it will not crash on CUDA platforms, just having some errors because the plugin is using some CUDA related data which have been explicitly destroyed when the plugin is destroyed. Redesign of libomptarget is beyond the scope of this patch. It might be done when LLVM OpenMP decides to support Windows. For now this patch is like a WA. |
The memory will be eventually released by the driver if the process is terminated, right?
I was hoping the plugin would expose a function that libomptarget calls to do the cleanup, in place of today's global destructor. Libomptarget can then do an ordered cleanup, after the memory manager destructor has run.
I have some new ideas to make the cleanup possible. Will update the patch accordingly.
Maybe. Probably platform-specific. I definitely have to power cycle GPUs to get them working again occasionally so the kernel doesn't (always?) leave the stack in a just-booted state after process exit.
Leaking memory by design thwarts using memory tracking tools to find accidental leaks. In this case, it's indicative that we've failed to establish static lifetimes for the components. Too much global state.
The complexity of this approach is very high relative to instantiating MemoryManagerTy in the cuda plugin, and optionally in any others that want to use it.
openmp/libomptarget/plugins/cuda/src/rtl.cpp | ||
---|---|---|
282 | It looks like this stuff needs to be implemented in every plugin, since they all use the memory manager at present |
This approach does not only fix the issue, but also open a door for future development. RTLInfoTy is used by DeviceTy, so each one has pointers to the other. It is a weak dependence so not a problem. We have a lot of similar things in LLVM, such as use and user.
The callback function can also be used later as it is the only chance for libomptarget to know when the plugin is destroyed and do something needed.
In the future we will add something to let a plugin exchange information with libomptarget. For example, a plugin can tell libomptarget what features it can support and libomptarget can use that information accordingly. Moving the memory manager to plugin and taking it as a common part is also feasible.
clang-tidy: warning: invalid case style for function '__tgt_rtl_register_destruction_cb' [readability-identifier-naming]
not useful
clang-tidy: warning: invalid case style for parameter 'cb' [readability-identifier-naming]
not useful
clang-tidy: warning: invalid case style for parameter 'data' [readability-identifier-naming]
not useful