⚙ D83372 Fix for memory leak reported by Valgrind

mwezdeck created this revision.Jul 8 2020, 12:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2020, 12:54 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

Harbormaster failed remote builds in B63370: Diff 276326!Jul 8 2020, 1:11 AM

mwezdeck updated this revision to Diff 276340.Jul 8 2020, 1:48 AM

mwezdeck set the repository for this revision to rG LLVM Github Monorepo.Jul 8 2020, 2:15 AM

mwezdeck added a reviewer: yavn.Jul 8 2020, 4:55 AM

Can I kindly ask for review?

mwezdeck added a reviewer: tpr.Jul 8 2020, 6:03 AM

+Chris Bieneman

I'm a bit worried that there is a reason why the removed code was there, rather than using a static initializer.

Does this add a static constructor?

In D83372#2139641, @lattner wrote:

Does this add a static constructor?

Yes it does. std::recursive_mutex has non-trivial constructors and destructors.

The code that was there knowingly leaked the mutex to avoid the static constructors and destructors. It should be safe to destroy it in llvm_shutdown, but making it a global is bad.

Can anybody point me out why static constructors are bad? Is there some kind of idiom?

Any resource where I can read about that will be helpful.

Thanks.

I'm updating the patch

There are a lot of downsides to global constructors. One is that they are difficult to debug because the code runs before main, so if anything goes wrong callstacks are odd, debuggers sometimes struggle to attach, and general headaches ensue.

Additionally global constructors are an always-paid performance hit. Local statics are only constructed if they are used, but global statics are always constructed, and transitively they construct anything else they use. This can severely impact the process launch time of a program and it is very difficult to optimize around (other than just not using global constructors). While process launch time may not be something most applications care about because it "only happens once" for a compiler that can be invoked tens or hundreds of thousands of times per day per user, that cost really adds up.

The last big issue with global constructors in LLVM is that LLVM is used as a library in many contexts, and global constructors (and generally global variables) cause lots of issues with that. One of the big problems with using global constructors in library code is that it is really easy to get multiple copies of the globals embedded into the same process. This generally happens when you have a single executable that links two libraries that both include parts of LLVM (in particular libLLVMSupport which you usually want to be able to inline/LTO across). There have been periodic pushes to clean up the global variables, and most of the ones in LLVM today are cl::opt objects which need a re-design to clean up.

Logically your patch here looks fine. I'd rather see it use a std::lock_guard as the original code did, with a nested scope added, but that is pretty nitpicky.

Also the code doesn't conform to LLVM's coding standards for variable naming (https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly).

Thank you very much for explanation. Indeed, the context where user loads an so file thousands times a day makes requirement for startup time very important.

Thanks again for poining this out to me.

I've updated the patch.

Harbormaster completed remote builds in B63704: Diff 276941.Jul 10 2020, 1:42 AM

Can anybody review this change one more time? I will be really grateful.

Hi gentlemen,

Could you please review this change?

Thank you,
Maksym

Some thoughts:

Nit: you only need to call getManagedStaticMutex() once.
I would also recommend resetting the ManagedStaticMutex pointer to null after deleting the mutex to help with scary dangling pointer bugs.
Since you're deleting the mutex, then by definition this can only be called when there is a single thread touching LLVM. Please document this in the public header. This also means you don't have to acquire the mutex at all.

What happens when someone creates a new managed static after this? This should work. Can you reset the mutex_init_flag somehow?

In D83372#2172411, @lattner wrote:

What happens when someone creates a new managed static after this? This should work. Can you reset the mutex_init_flag somehow?

I think there may be issues beyond just this change that would prevent LLVM from functioning after llvm_shutdown is called. I wouldn't expect this change to make that suddenly work.

In terms of making the flag resettable, it would have to stop using llvm::once_flag. llvm::once_flag is an llvm implementation of std::once_flag for platforms that don't have the std version, and the std version is not resettable.

Ok, then please set the pointer to null - that will cause things to explode if someone tries to use things after llvm_shutdown. Thanks!

Second initialization is required to handle corner cases. It will not happen in real world application. There is call for ManagedStatic's function after llvm_shutdown in DynamicLibrary unit-test, this second initialization is for covering that.

Harbormaster failed remote builds in B65975: Diff 281147!Jul 28 2020, 3:15 AM

Hello,

Can anybody from reviewers look into this change?

It would be really great if this patch can be merged this month.
Maksym

Looks good, thanks!

This revision is now accepted and ready to land.Aug 4 2020, 9:17 AM

I think I can't push this change to repo by myself. Can I kindly ask somebody with access rights to push commit, to push this change to repo? I will be really grateful.

Thank you one more time.

Can anybody push this commit to repo?

Thanks,
Maksym

Having different behavior for tests V production worries me a bit.

How would people feel about this instead:

static std::recursive_mutex *getManagedStaticMutex() {
  static std::recursive_mutex m;
  return &m;
}

(& leave llvm_shutdown as-is)

No global ctor - lazily constructed. No global dtor... hmm, libstdc++ has a trivial dtor at least when I look at it one way (https://godbolt.org/z/ae6j7f) , but libc++'s is non-trivial.

Is the additional complexity worth saving this one global dtor? (assuming I'm even right that the approach I'm suggesting would address the issue under discussion in this review)

I certainly have some empathy for the consistency of avoiding all global ctors/dtors, but this one's a bit of a case of "who watches the watchmen" - can't really add another mutex to guard the initialization of this mutex. Could use atomic ops but that's tricky to get right/maintain.

We need probably opinion from other reviewers. @lattner , @beanz - what do you think about David's proposition?

I'm uploading new revision, according to suggestions above.

Hello, this change would be very useful for SwiftShader, a CPU-targeted Vulkan driver used by Chrome and Android (among other projects). This would plug memory leaks reported when unloading our driver. Can we get this reviewed/merged please?

Herald added a subscriber: dexonsmith. · View Herald TranscriptJan 26 2021, 8:18 AM

In D83372#2522877, @amaiorano wrote:

Hello, this change would be very useful for SwiftShader, a CPU-targeted Vulkan driver used by Chrome and Android (among other projects). This would plug memory leaks reported when unloading our driver. Can we get this reviewed/merged please?

What sort of situation are you dealing with/how does this issue manifest for you?

My understanding is that this would be a problem for users unloading/reloading a dynamic library (dll, so, dylib) but generally not an issue for static linking, for instance? & given, as @beanz said, llvm is unlikely to support re-use of llvm after shutdown, this doesn't seem like it'd go terribly well if you expect to be able to unload and then reload LLVM again later.

What sort of situation are you dealing with/how does this issue manifest for you?

So basically we produce a shared library (DLL, .so, .dylib) that implements the Vulkan API. The issue manifests itself from our internal test bots which, as part of their tests, load the vulkan driver, perform Vulkan calls, validate the result, and then unload the driver. Some of these bots enable Leak Sanitizer (LSAN), and upon ending the test process, reports memory leaks - including those from and loaded/unloaded shared libraries. This also shows up on Windows when using Application Verifier.

My understanding is that this would be a problem for users unloading/reloading a dynamic library (dll, so, dylib) but generally not an issue for static linking, for instance? & given, as @beanz said, llvm is unlikely to support re-use of llvm after shutdown, this doesn't seem like it'd go terribly well if you expect to be able to unload and then reload LLVM again later.

I'm not sure what the problem is with unloading and reloading LLVM. For our use-case, anyway, it seems to work fine. We use LLVM's JIT API, so maybe that has something to do with it, though I don't really understand what the problem could be in a general sense. Unloading a shared library should clean up all allocations and handled, then reloading it again should allocate and initialize everything again.

Now, having written all that, there's been some developments since I posted yesterday. Using the above latest patch, along with invoking llvm::llvm_shutdown(), worked fine on our Windows build, and no more leaks were detected by AppVerifier. Note that in order to invoke llvm::llvm_shutdown, I used the LLVM facility "llvm_shutdown_obj" class, declared as a global in one of our translation units (this class's destructor invokes llvm_shutdown()). Now, except for on Windows via DllMain, there is no hook for "unloading DLL before global objects are destroyed" on Linux or Mac (maybe also Android, and other platforms), so we create this global, which seemed to work fine. Except that it didn't work on Linux, and the reason is global object destruction order: llvm::llvm_shutdown locks the ManagedStaticMutex, so it must be called before the ManagedStaticMutex static instance is destroyed. On Windows, it happened to destroy my global llvm_shutdown_obj instance before it destroyed the ManagedStaticMutex instance (named 'm' in this patch). But on Linux, it was reversed. Classic C++ problem, where the only standard guarantee we have is that objects in the same translation unit are constructed in declaration order, and destructed in the reverse order; but for objects in different TUs, there is no guarantee.

I'm thinking this is probably the reason ManagedStaticMutex was allowed to leak in the first place: there was no way to guarantee that clients would invoke llvm::llvm_shutdown before this global got destroyed, especially when making use of the llvm_shutdown_obj RAII class.

Looking at Patch 5, the one just before this latest version, I think might work, since it relies on llvm_shutdown deleting the ManagedStaticMutex after it uses it. Of course, this also implies that this call should not be made from multiple threads anyway, so perhaps another solution is to keep the latest patch as-is, but remove the lock_guard in llvm_shutdown, and document that it must not be called from multiple threads, as @lattner proposed.

Coming back to this, in SwiftShader, I've landed this patch plus the removal of the lock on getManagedStaticMutex from within llvm::llvm_shutdown. I think this is the best approach: document that llvm_shutdown is to be called from a single thread, along with this patch (the last version, with static std::recursive_mutex m; in getManagedStaticMutex. This allows for code that uses llvm::llvm_shutdown_obj to work properly, without worrying about order of destruction, and it plugs the memory leak.

Sounds OK to me - if this patch can be updated to add a comment to llvm_shutdown to say it's only safe to call without any other threads executing LLVM APIs.

(re: questions about whether llvm_shutdown is usable - yeah, probably if you're unloading the whole dll after llvm_shutdown, that seems like a scenario we could/should support. But running llvm_shutdown, and then using LLVM APIs again in the same process/libraries - that might not go so well & might be less likely to be something that'd be feasible support with a few small patches)

Perhaps llvm_shutdown should be documented to only be used if you're about to unload the library.

Sounds OK to me - if this patch can be updated to add a comment to llvm_shutdown to say it's only safe to call without any other threads executing LLVM APIs.

... and remove the locking in said function. Sorry if that was implied in what you wrote, but it's important for this patch to work.

(re: questions about whether llvm_shutdown is usable - yeah, probably if you're unloading the whole dll after llvm_shutdown, that seems like a scenario we could/should support. But running llvm_shutdown, and then using LLVM APIs again in the same process/libraries - that might not go so well & might be less likely to be something that'd be feasible support with a few small patches)

Perhaps llvm_shutdown should be documented to only be used if you're about to unload the library.

Yes, I agree. Or at least a comment stating that after llvm_shutdown, no further LLVM calls should be made.

@mwezdeck Do you think you can update this patch with the recommendations above, namely:

Add a comment to llvm_shutdown to say it's only safe to call without any other threads executing LLVM APIs.
Remove the locking of getManagedStaticMutex() in llvm_shutdown altogether.

It would be great to land this patch with these changes.

In D83372#2581740, @amaiorano wrote:

@mwezdeck Do you think you can update this patch with the recommendations above, namely:

Add a comment to llvm_shutdown to say it's only safe to call without any other threads executing LLVM APIs.

Remove the locking of getManagedStaticMutex() in llvm_shutdown altogether.

It would be great to land this patch with these changes.

I have uploaded new version of the patch. Please, review it.

Harbormaster completed remote builds in B93806: Diff 330644.Mar 15 2021, 7:30 AM

SGTM

llvm/lib/Support/ManagedStatic.cpp
71–74	Wouldn't mind some extra clarification that this should be the last use of LLVM APIs in this loaded module (executable/dynamic library) - shutting down and then restarting use of LLVM in the same process is probably not supported/not practical to support so far as I know. (could be wrong though/open to discussion)