If llvm so lib is dlopened and dlclosed several times, then memory leak can be observed, reported by Valgrind.
This patch fixes the issue.
Differential D83372
Fix for memory leak reported by Valgrind mwezdeck on Jul 8 2020, 12:54 AM. Authored by
Details If llvm so lib is dlopened and dlclosed several times, then memory leak can be observed, reported by Valgrind. This patch fixes the issue.
Diff Detail
Event TimelineComment Actions +Chris Bieneman I'm a bit worried that there is a reason why the removed code was there, rather than using a static initializer. Comment Actions Yes it does. std::recursive_mutex has non-trivial constructors and destructors. The code that was there knowingly leaked the mutex to avoid the static constructors and destructors. It should be safe to destroy it in llvm_shutdown, but making it a global is bad. Comment Actions Can anybody point me out why static constructors are bad? Is there some kind of idiom? Any resource where I can read about that will be helpful. Thanks. I'm updating the patch Comment Actions There are a lot of downsides to global constructors. One is that they are difficult to debug because the code runs before main, so if anything goes wrong callstacks are odd, debuggers sometimes struggle to attach, and general headaches ensue. Additionally global constructors are an always-paid performance hit. Local statics are only constructed if they are used, but global statics are always constructed, and transitively they construct anything else they use. This can severely impact the process launch time of a program and it is very difficult to optimize around (other than just not using global constructors). While process launch time may not be something most applications care about because it "only happens once" for a compiler that can be invoked tens or hundreds of thousands of times per day per user, that cost really adds up. The last big issue with global constructors in LLVM is that LLVM is used as a library in many contexts, and global constructors (and generally global variables) cause lots of issues with that. One of the big problems with using global constructors in library code is that it is really easy to get multiple copies of the globals embedded into the same process. This generally happens when you have a single executable that links two libraries that both include parts of LLVM (in particular libLLVMSupport which you usually want to be able to inline/LTO across). There have been periodic pushes to clean up the global variables, and most of the ones in LLVM today are cl::opt objects which need a re-design to clean up. Comment Actions Logically your patch here looks fine. I'd rather see it use a std::lock_guard as the original code did, with a nested scope added, but that is pretty nitpicky. Also the code doesn't conform to LLVM's coding standards for variable naming (https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly). Comment Actions Thank you very much for explanation. Indeed, the context where user loads an so file thousands times a day makes requirement for startup time very important. Thanks again for poining this out to me. I've updated the patch. Comment Actions Some thoughts:
What happens when someone creates a new managed static after this? This should work. Can you reset the mutex_init_flag somehow? Comment Actions I think there may be issues beyond just this change that would prevent LLVM from functioning after llvm_shutdown is called. I wouldn't expect this change to make that suddenly work. In terms of making the flag resettable, it would have to stop using llvm::once_flag. llvm::once_flag is an llvm implementation of std::once_flag for platforms that don't have the std version, and the std version is not resettable. Comment Actions Ok, then please set the pointer to null - that will cause things to explode if someone tries to use things after llvm_shutdown. Thanks! Comment Actions Second initialization is required to handle corner cases. It will not happen in real world application. There is call for ManagedStatic's function after llvm_shutdown in DynamicLibrary unit-test, this second initialization is for covering that. Comment Actions Hello, Can anybody from reviewers look into this change? It would be really great if this patch can be merged this month. Comment Actions I think I can't push this change to repo by myself. Can I kindly ask somebody with access rights to push commit, to push this change to repo? I will be really grateful. Thank you one more time. Comment Actions Having different behavior for tests V production worries me a bit. How would people feel about this instead: static std::recursive_mutex *getManagedStaticMutex() { static std::recursive_mutex m; return &m; } (& leave llvm_shutdown as-is) No global ctor - lazily constructed. No global dtor... hmm, libstdc++ has a trivial dtor at least when I look at it one way (https://godbolt.org/z/ae6j7f) , but libc++'s is non-trivial. Is the additional complexity worth saving this one global dtor? (assuming I'm even right that the approach I'm suggesting would address the issue under discussion in this review) I certainly have some empathy for the consistency of avoiding all global ctors/dtors, but this one's a bit of a case of "who watches the watchmen" - can't really add another mutex to guard the initialization of this mutex. Could use atomic ops but that's tricky to get right/maintain. Comment Actions Hello, this change would be very useful for SwiftShader, a CPU-targeted Vulkan driver used by Chrome and Android (among other projects). This would plug memory leaks reported when unloading our driver. Can we get this reviewed/merged please? Comment Actions What sort of situation are you dealing with/how does this issue manifest for you? My understanding is that this would be a problem for users unloading/reloading a dynamic library (dll, so, dylib) but generally not an issue for static linking, for instance? & given, as @beanz said, llvm is unlikely to support re-use of llvm after shutdown, this doesn't seem like it'd go terribly well if you expect to be able to unload and then reload LLVM again later. Comment Actions
So basically we produce a shared library (DLL, .so, .dylib) that implements the Vulkan API. The issue manifests itself from our internal test bots which, as part of their tests, load the vulkan driver, perform Vulkan calls, validate the result, and then unload the driver. Some of these bots enable Leak Sanitizer (LSAN), and upon ending the test process, reports memory leaks - including those from and loaded/unloaded shared libraries. This also shows up on Windows when using Application Verifier.
I'm not sure what the problem is with unloading and reloading LLVM. For our use-case, anyway, it seems to work fine. We use LLVM's JIT API, so maybe that has something to do with it, though I don't really understand what the problem could be in a general sense. Unloading a shared library should clean up all allocations and handled, then reloading it again should allocate and initialize everything again. Now, having written all that, there's been some developments since I posted yesterday. Using the above latest patch, along with invoking llvm::llvm_shutdown(), worked fine on our Windows build, and no more leaks were detected by AppVerifier. Note that in order to invoke llvm::llvm_shutdown, I used the LLVM facility "llvm_shutdown_obj" class, declared as a global in one of our translation units (this class's destructor invokes llvm_shutdown()). Now, except for on Windows via DllMain, there is no hook for "unloading DLL before global objects are destroyed" on Linux or Mac (maybe also Android, and other platforms), so we create this global, which seemed to work fine. Except that it didn't work on Linux, and the reason is global object destruction order: llvm::llvm_shutdown locks the ManagedStaticMutex, so it must be called before the ManagedStaticMutex static instance is destroyed. On Windows, it happened to destroy my global llvm_shutdown_obj instance before it destroyed the ManagedStaticMutex instance (named 'm' in this patch). But on Linux, it was reversed. Classic C++ problem, where the only standard guarantee we have is that objects in the same translation unit are constructed in declaration order, and destructed in the reverse order; but for objects in different TUs, there is no guarantee. I'm thinking this is probably the reason ManagedStaticMutex was allowed to leak in the first place: there was no way to guarantee that clients would invoke llvm::llvm_shutdown before this global got destroyed, especially when making use of the llvm_shutdown_obj RAII class. Looking at Patch 5, the one just before this latest version, I think might work, since it relies on llvm_shutdown deleting the ManagedStaticMutex after it uses it. Of course, this also implies that this call should not be made from multiple threads anyway, so perhaps another solution is to keep the latest patch as-is, but remove the lock_guard in llvm_shutdown, and document that it must not be called from multiple threads, as @lattner proposed. Comment Actions Coming back to this, in SwiftShader, I've landed this patch plus the removal of the lock on getManagedStaticMutex from within llvm::llvm_shutdown. I think this is the best approach: document that llvm_shutdown is to be called from a single thread, along with this patch (the last version, with static std::recursive_mutex m; in getManagedStaticMutex. This allows for code that uses llvm::llvm_shutdown_obj to work properly, without worrying about order of destruction, and it plugs the memory leak. Comment Actions Sounds OK to me - if this patch can be updated to add a comment to llvm_shutdown to say it's only safe to call without any other threads executing LLVM APIs. (re: questions about whether llvm_shutdown is usable - yeah, probably if you're unloading the whole dll after llvm_shutdown, that seems like a scenario we could/should support. But running llvm_shutdown, and then using LLVM APIs again in the same process/libraries - that might not go so well & might be less likely to be something that'd be feasible support with a few small patches) Perhaps llvm_shutdown should be documented to only be used if you're about to unload the library. Comment Actions
... and remove the locking in said function. Sorry if that was implied in what you wrote, but it's important for this patch to work.
Yes, I agree. Or at least a comment stating that after llvm_shutdown, no further LLVM calls should be made. Comment Actions @mwezdeck Do you think you can update this patch with the recommendations above, namely:
It would be great to land this patch with these changes. Comment Actions SGTM
Comment Actions I uploaded new version with one comment line added. @dblaikie , do you have a permission to merge these changes? |
Wouldn't mind some extra clarification that this should be the last use of LLVM APIs in this loaded module (executable/dynamic library) - shutting down and then restarting use of LLVM in the same process is probably not supported/not practical to support so far as I know. (could be wrong though/open to discussion)