Page MenuHomePhabricator

parallelize calling of Module::PreloadSymbols()
ClosedPublic

Authored by llunak on Apr 2 2022, 6:06 AM.

Details

Summary

If LLDB index cache is enabled and everything is cached, then loading of debug info is essentially single-threaded, because it's done from PreloadSymbols() called from GetOrCreateModule(), which is called from a loop calling LoadModuleAtAddress() in DynamicLoaderPOSIXDYLD. Parallelizing the entire loop could be unsafe because of GetOrCreateModule() operating on a module list, so instead move only the PreloadSymbols() call to Target::ModulesDidLoad() and parallelize there, which should be safe.

This may greatly reduce the load time if the debugged program uses a large number of binaries (as opposed to monolithic programs where this presumably doesn't make a difference). In my specific case of LibreOffice Calc this reduces startup time from 6s to 2s.

Presumably a similar change could be done for other platforms (I have no plans to do so).

Diff Detail

Event Timeline

llunak created this revision.Apr 2 2022, 6:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2022, 6:06 AM
llunak requested review of this revision.Apr 2 2022, 6:06 AM
labath requested changes to this revision.Apr 4 2022, 9:02 AM

This is an interesting area for optimisation, but I don't think it's going to be that easy. A couple of years ago, I experimented with a similar approach, but I could not come up with a satisfactory solution in a reasonable amount of time. This is how the PrefetchModuleSpecs call came into being. I'm not entirely proud of it, but it was sufficient for my use case, and it is single threaded (which is always nice).

To move forward with this we'd basically need to resolve all of the issues that you mention in the patch description:

  • threadpool-within-threadpool is a real problem. We shouldn't have n^2 threads running in the system. That will cause thrashing, possible OOMs, and just make it generally hard to figure out what's going on. I'm not sure what did you mean by the semaphore idea, but I don't think it would be sufficient to grab a semaphore in some thread before starting some work -- that'll still leave us with n^2 threads. We should prevent this many threads from being spawned in the first place.
  • the thread safety claim is extremely bold. There's a big difference between running something in one thread (it doesn't matter that it's not the applications main thread -- this is running on a thread which is dedicated to serving one particular process), and spawning a bunch of threads to do something in parallel. It took us several years to fix all the corner cases relating to the dwarf index pooling, and that code was much more self-contained than this. The Target::GetOrCreateModule function (the core of this functionality) is pretty complex, consist of multiple levels of retries and lookups, and I don't think it would produce reasonable results if it was executed concurrently. I believe that in the normal case (all of the modules are distinct) it would do the right thing, but I'm pretty sure that it could do something strange and unexpected if the list contained (e.g.) the same module twice. The obvious fix (add a lock guard the manipulation of the target module list) would most likely defeat the main purpose of this patch. So, I think we would need to implement this function differently, or at least, provide some proof that current implementation is correct.

But before you go and try to do all of that (if this wasn't enough to discourage you :P), I'd like to understand what is the precise thing that you're trying to optimise. If there is like a single hot piece of code that we want to optimise, then we might be able to come up with an different approach (a'la PrefetchModuleSpecs) to achieve the same thing.

This revision now requires changes to proceed.Apr 4 2022, 9:02 AM
llunak added a comment.Apr 4 2022, 9:45 AM

I'd like to understand what is the precise thing that you're trying to optimise. If there is like a single hot piece of code that we want to optimise, then we might be able to come up with an different approach (a'la PrefetchModuleSpecs) to achieve the same thing.

The scenario is a repeated lldb start with index cache enabled (and so everything necessary cached), the debugged app is debug build of LibreOffice Calc (~100 medium-sized C++ libraries). 90% of the CPU time is spent in LoadFromCache() calls, because 70% of the CPU time is spent in ConstString (and that's after my ConstString optimizations). I don't think there's any other way to make it faster other than parallelizing it, although parallelizing only LoadFromCache() should be enough and I'd expect that to be limited enough to be safe if you expect the lldb codebase is not generally thread-safe. In this patch I parallelized higher because it was simple to do it up there, and I don't know how to easily get at all the LoadFromCache() calls (especially the one called from SymbolFileDWARFDebugMap is relatively deep). If you know how to parallelize only that, that works for me too.

I had tried something similar with the thread pools when trying to parallelize similar stuff. The solution I made was to have a global thread pool for the entire LLDB process, but then the LLVM thread pool stuff needed to be modified to handle different groups of threads where work could be added to a queue and then users can wait on the queue. The queues then need to be managed by the thread pool code. Queues could also be serial queues or concurrent queues. I never completed the patch, but just wanted to pass along the ideas I had used. So instead of adding everything to a separate pool, the main thread pool could take queues. The code for your code above would look something like:

llvm::ThreadPool::Queue queue(llvm::ThreadPool::PoolType::Concurrent);
for (size_t i = 0; i < infos.size(); ++i)
  queue.push(load_module_fn, i);
Debugger::GetThreadPool().wait(queue);

We could have a static function on Debugger, or just make a static function inside of LLDB to grab the thread pool, and queue up the work for the individual queues. Then we can have one central location for the thread pool and anyone can throw work onto the pool with individual queues that can be waited on separately.

I had tried something similar with the thread pools when trying to parallelize similar stuff. The solution I made was to have a global thread pool for the entire LLDB process, but then the LLVM thread pool stuff needed to be modified to handle different groups of threads where work could be added to a queue and then users can wait on the queue. The queues then need to be managed by the thread pool code. Queues could also be serial queues or concurrent queues. I never completed the patch, but just wanted to pass along the ideas I had used. So instead of adding everything to a separate pool, the main thread pool could take queues. The code for your code above would look something like:

I got a similar idea too, but I think it wouldn't work here in this threadpool-within-threadpool situation. If you limit the total number of threads, then the "outer" tasks could allocate all the threads and would deadlock on "inner" tasks waiting for threads (and if you don't limit threads then there's no need to bother). That's why I came up with the semaphore idea, as that would require threads to acquire slots and "outer" tasks could temporarily release them when spawning "inner" tasks (I didn't consider pending threads to be a problem, but if it is, then the supposed ThreadPool queues presumably could do that somehow too).

But if lldb code is not considered thread-safe to parallelize the module loading at this level, then this is all irrelevant. If it's required to parallelize only cache loading, then that removes the threadpool-within-threadpool situation. I don't know if I understand the code enough to try that though.

labath added a comment.Apr 5 2022, 2:47 AM

I'd like to understand what is the precise thing that you're trying to optimise. If there is like a single hot piece of code that we want to optimise, then we might be able to come up with an different approach (a'la PrefetchModuleSpecs) to achieve the same thing.

The scenario is a repeated lldb start with index cache enabled (and so everything necessary cached), the debugged app is debug build of LibreOffice Calc (~100 medium-sized C++ libraries). 90% of the CPU time is spent in LoadFromCache() calls, because 70% of the CPU time is spent in ConstString (and that's after my ConstString optimizations). I don't think there's any other way to make it faster other than parallelizing it, although parallelizing only LoadFromCache() should be enough and I'd expect that to be limited enough to be safe if you expect the lldb codebase is not generally thread-safe. In this patch I parallelized higher because it was simple to do it up there, and I don't know how to easily get at all the LoadFromCache() calls (especially the one called from SymbolFileDWARFDebugMap is relatively deep). If you know how to parallelize only that, that works for me too.

OK, got it. So, for this case, I think the best approach would be to extract and paralelize the PreloadSymbols calls. They are not deep (=> relatively easy to extract), optional (they are just a performance optimization, so nothing will break if they're skipped), and completely isolated (they only access data from the single module).

In fact we already have a sort of an existing place to do this kind of group module actions. The reason that the ModulesDidLoad (line 621) call does not happen inside GetOrCreateModule is because we want to send just one load event instead of spamming the user with potentially hundreds of messages. I don't think it would be unreasonable to move the PreloadSymbols call from to ModulesDidLoad.

Of course, we still need to figure out the parallelization story.

I had tried something similar with the thread pools when trying to parallelize similar stuff. The solution I made was to have a global thread pool for the entire LLDB process, but then the LLVM thread pool stuff needed to be modified to handle different groups of threads where work could be added to a queue and then users can wait on the queue. The queues then need to be managed by the thread pool code. Queues could also be serial queues or concurrent queues. I never completed the patch, but just wanted to pass along the ideas I had used. So instead of adding everything to a separate pool, the main thread pool could take queues. The code for your code above would look something like:

I got a similar idea too, but I think it wouldn't work here in this threadpool-within-threadpool situation. If you limit the total number of threads, then the "outer" tasks could allocate all the threads and would deadlock on "inner" tasks waiting for threads (and if you don't limit threads then there's no need to bother). That's why I came up with the semaphore idea, as that would require threads to acquire slots and "outer" tasks could temporarily release them when spawning "inner" tasks (I didn't consider pending threads to be a problem, but if it is, then the supposed ThreadPool queues presumably could do that somehow too).

Pending/suspended threads are less of a problem then threads actively contending for cpu time, but still less than ideal. On my machine I could end up with over 2k threads. At 8MB per thread, that's 16GB just for stack (most of it probably unused, but still...). And some machines have more (sometimes a lot more) CPUs than I do.

Properly implementing thread pools is tricky, and I don't consider myself an expert, but I think that, instead of using semaphores, you could detect that the case when wait() is being called from inside a thread pool thread, and then, instead of passively waiting for the task to finish, start eagerly evaluating it on the same thread.

I'm pretty sure I didn't come up with this idea, so it's possible that something like this is already implemented in llvm, and I got the idea from there.

But if lldb code is not considered thread-safe to parallelize the module loading at this level, then this is all irrelevant. If it's required to parallelize only cache loading, then that removes the threadpool-within-threadpool situation. I don't know if I understand the code enough to try that though.

Yeah, I'd rather avoid parallelizing the logic for populating the target's module and load lists. Even if the result would be always correct, just the mere fact that the module lists (and the output of e.g. image list) would be in nondeterministic order is not ideal.

llunak added a comment.Apr 5 2022, 3:21 AM

OK, got it. So, for this case, I think the best approach would be to extract and paralelize the PreloadSymbols calls. They are not deep (=> relatively easy to extract), optional (they are just a performance optimization, so nothing will break if they're skipped), and completely isolated (they only access data from the single module).

In fact we already have a sort of an existing place to do this kind of group module actions. The reason that the ModulesDidLoad (line 621) call does not happen inside GetOrCreateModule is because we want to send just one load event instead of spamming the user with potentially hundreds of messages. I don't think it would be unreasonable to move the PreloadSymbols call from to ModulesDidLoad.

I've meanwhile had a closer look at the relevant functions and I've come to a similar conclusion (minus the ModulesDidLoad part). I think GetOrCreateModule currently does call ModulesDidLoad for each module, but that should be easy to change (and PreloadSymbols comes from 7fca8c0757a5ee5f290844376c7f8c5f3c1ffcfe , which means the function should be fine being moved there). I'll have a go at this.

Pending/suspended threads are less of a problem then threads actively contending for cpu time, but still less than ideal. On my machine I could end up with over 2k threads. At 8MB per thread, that's 16GB just for stack (most of it probably unused, but still...). And some machines have more (sometimes a lot more) CPUs than I do.

Properly implementing thread pools is tricky, and I don't consider myself an expert, but I think that, instead of using semaphores, you could detect that the case when wait() is being called from inside a thread pool thread, and then, instead of passively waiting for the task to finish, start eagerly evaluating it on the same thread.

I'm pretty sure I didn't come up with this idea, so it's possible that something like this is already implemented in llvm, and I got the idea from there.

Those 16GB would be address space, most of which wouldn't be memory. I'm not an expert on that, but that's why I didn't consider it to be a problem, e.g. sanitizers allocate way more address space. But ok, I can adjust ThreadPool to be reusable for different groups of tasks. I've thought of the processing wait idea too, ThreadPool currently can't do that, but I think it should be easy to add.

llunak updated this revision to Diff 420492.Apr 5 2022, 7:00 AM
llunak retitled this revision from parallelize module loading in DynamicLoaderPOSIXDYLD() to parallelize calling of Module::PreloadSymbols().
llunak edited the summary of this revision. (Show Details)
llunak added a reviewer: jingham.

Ok, parallelizing of only Module::PreloadSymbols() was simpler than I expected and it also works, so I've reworked the patch to do that.

This change now requires D123128 and also adding the grouping ability to ThreadPool, which I haven't done yet.

After applying this patch I started seeing data races reported by TSan when running the shell tests (check-lldb-shell). It seems to happen to different tests on different runs but the backtraces are the same.

WARNING: ThreadSanitizer: data race (pid=40880)
  Read of size 1 at 0x0001139ae0a8 by thread T3 (mutexes: write M4094, write M119908344493401136):
    #0 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorEncoding() ItaniumDemangle.h:3033 (liblldb.15.0.0git.dylib:arm64+0x555e668)
    #1 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*) ItaniumDemangle.h:3060 (liblldb.15.0.0git.dylib:arm64+0x55628dc)
    #2 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseUnqualifiedName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*, llvm::itanium_demangle::Node*, llvm::itanium_demangle::ModuleName*) ItaniumDemangle.h:2803 (liblldb.15.0.0git.dylib:arm64+0x5559ff0)
    #3 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*) ItaniumDemangle.h:2666 (liblldb.15.0.0git.dylib:arm64+0x5556c08)
    #4 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseEncoding() ItaniumDemangle.h:5038 (liblldb.15.0.0git.dylib:arm64+0x555108c)
    #5 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parse() ItaniumDemangle.h:5449 (liblldb.15.0.0git.dylib:arm64+0x554b9b0)
    #6 llvm::ItaniumPartialDemangler::partialDemangle(char const*) ItaniumDemangle.cpp:385 (liblldb.15.0.0git.dylib:arm64+0x554c2a0)
    #7 lldb_private::RichManglingContext::FromItaniumName(lldb_private::ConstString) RichManglingContext.cpp:42 (liblldb.15.0.0git.dylib:arm64+0x2df084)
    #8 lldb_private::Mangled::GetRichManglingInfo(lldb_private::RichManglingContext&, bool (*)(llvm::StringRef, lldb_private::Mangled::ManglingScheme)) Mangled.cpp:217 (liblldb.15.0.0git.dylib:arm64+0x2bfec4)
    #9 lldb_private::Symtab::InitNameIndexes() Symtab.cpp:331 (liblldb.15.0.0git.dylib:arm64+0x41aaa0)
    #10 lldb_private::Symtab::PreloadSymbols() Symtab.cpp:456 (liblldb.15.0.0git.dylib:arm64+0x41ba4c)
    #11 lldb_private::Module::PreloadSymbols() Module.cpp:1378 (liblldb.15.0.0git.dylib:arm64+0x2c74d4)
    #12 std::__1::__function::__func<std::__1::__bind<lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&)::$_0&, unsigned long&>, std::__1::allocator<std::__1::__bind<lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&)::$_0&, unsigned long&> >, void ()>::operator()() function.h:352 (liblldb.15.0.0git.dylib:arm64+0x4d73e0)
    #13 std::__1::__function::__func<llvm::ThreadPool::createTaskAndFuture(std::__1::function<void ()>)::'lambda'(), std::__1::allocator<llvm::ThreadPool::createTaskAndFuture(std::__1::function<void ()>)::'lambda'()>, void ()>::operator()() function.h:352 (liblldb.15.0.0git.dylib:arm64+0x4d6920)
    #14 void* llvm::thread::ThreadProxy<std::__1::tuple<llvm::ThreadPool::grow(int)::$_0> >(void*) thread.h:60 (liblldb.15.0.0git.dylib:arm64+0x9daa0c)

  Previous write of size 1 at 0x0001139ae0a8 by thread T6 (mutexes: write M26458652249517616, write M208572962157285296):
    #0 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorEncoding() ItaniumDemangle.h:3034 (liblldb.15.0.0git.dylib:arm64+0x555e720)
    #1 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*) ItaniumDemangle.h:3060 (liblldb.15.0.0git.dylib:arm64+0x55628dc)
    #2 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseUnqualifiedName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*, llvm::itanium_demangle::Node*, llvm::itanium_demangle::ModuleName*) ItaniumDemangle.h:2803 (liblldb.15.0.0git.dylib:arm64+0x5559ff0)
    #3 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseName(llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::NameState*) ItaniumDemangle.h:2666 (liblldb.15.0.0git.dylib:arm64+0x5556c08)
    #4 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseEncoding() ItaniumDemangle.h:5038 (liblldb.15.0.0git.dylib:arm64+0x555108c)
    #5 llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parse() ItaniumDemangle.h:5449 (liblldb.15.0.0git.dylib:arm64+0x554b9b0)
    #6 llvm::ItaniumPartialDemangler::partialDemangle(char const*) ItaniumDemangle.cpp:385 (liblldb.15.0.0git.dylib:arm64+0x554c2a0)
    #7 lldb_private::RichManglingContext::FromItaniumName(lldb_private::ConstString) RichManglingContext.cpp:42 (liblldb.15.0.0git.dylib:arm64+0x2df084)
    #8 lldb_private::Mangled::GetRichManglingInfo(lldb_private::RichManglingContext&, bool (*)(llvm::StringRef, lldb_private::Mangled::ManglingScheme)) Mangled.cpp:217 (liblldb.15.0.0git.dylib:arm64+0x2bfec4)
    #9 lldb_private::Symtab::InitNameIndexes() Symtab.cpp:331 (liblldb.15.0.0git.dylib:arm64+0x41aaa0)
    #10 lldb_private::Symtab::PreloadSymbols() Symtab.cpp:456 (liblldb.15.0.0git.dylib:arm64+0x41ba4c)
    #11 lldb_private::Module::PreloadSymbols() Module.cpp:1378 (liblldb.15.0.0git.dylib:arm64+0x2c74d4)
    #12 std::__1::__function::__func<std::__1::__bind<lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&)::$_0&, unsigned long&>, std::__1::allocator<std::__1::__bind<lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&)::$_0&, unsigned long&> >, void ()>::operator()() function.h:352 (liblldb.15.0.0git.dylib:arm64+0x4d73e0)
    #13 std::__1::__function::__func<llvm::ThreadPool::createTaskAndFuture(std::__1::function<void ()>)::'lambda'(), std::__1::allocator<llvm::ThreadPool::createTaskAndFuture(std::__1::function<void ()>)::'lambda'()>, void ()>::operator()() function.h:352 (liblldb.15.0.0git.dylib:arm64+0x4d6920)
    #14 void* llvm::thread::ThreadProxy<std::__1::tuple<llvm::ThreadPool::grow(int)::$_0> >(void*) thread.h:60 (liblldb.15.0.0git.dylib:arm64+0x9daa0c)

  Location is global 'llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorEncoding()::Done' at 0x0001139ae0a8 (liblldb.15.0.0git.dylib+0x83360a8)

  Mutex M4094 (0x00010890b130) created at:
    #0 pthread_mutex_init <null>:16654816 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x3134c)
    #1 std::__1::recursive_mutex::recursive_mutex() <null>:16654816 (libc++.1.dylib:arm64e+0xf24c)
    #2 lldb_private::Module::Module(lldb_private::ModuleSpec const&) Module.cpp:132 (liblldb.15.0.0git.dylib:arm64+0x2c186c)
    #3 lldb_private::ModuleList::GetSharedModule(lldb_private::ModuleSpec const&, std::__1::shared_ptr<lldb_private::Module>&, lldb_private::FileSpecList const*, llvm::SmallVectorImpl<std::__1::shared_ptr<lldb_private::Module> >*, bool*, bool) ModuleList.cpp:830 (liblldb.15.0.0git.dylib:arm64+0x2ccf24)
    #4 lldb_private::PlatformDarwin::GetSharedModuleWithLocalCache(lldb_private::ModuleSpec const&, std::__1::shared_ptr<lldb_private::Module>&, lldb_private::FileSpecList const*, llvm::SmallVectorImpl<std::__1::shared_ptr<lldb_private::Module> >*, bool*) PlatformDarwin.cpp:262 (liblldb.15.0.0git.dylib:arm64+0x74c07c)
    #5 lldb_private::PlatformMacOSX::GetSharedModule(lldb_private::ModuleSpec const&, lldb_private::Process*, std::__1::shared_ptr<lldb_private::Module>&, lldb_private::FileSpecList const*, llvm::SmallVectorImpl<std::__1::shared_ptr<lldb_private::Module> >*, bool*) PlatformMacOSX.cpp:176 (liblldb.15.0.0git.dylib:arm64+0x75aaf8)
    #6 lldb_private::Target::GetOrCreateModule(lldb_private::ModuleSpec const&, bool, lldb_private::Status*) Target.cpp:2118 (liblldb.15.0.0git.dylib:arm64+0x4c6d74)
    #7 lldb_private::Target::SetExecutableModule(std::__1::shared_ptr<lldb_private::Module>&, lldb_private::LoadDependentFiles) Target.cpp:1460 (liblldb.15.0.0git.dylib:arm64+0x4c6638)
    #8 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, lldb_private::ArchSpec const&, lldb_private::LoadDependentFiles, std::__1::shared_ptr<lldb_private::Platform>&, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:320 (liblldb.15.0.0git.dylib:arm64+0x4e2ba8)
    #9 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:233 (liblldb.15.0.0git.dylib:arm64+0x4e1634)
    #10 lldb_private::TargetList::CreateTarget(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:52 (liblldb.15.0.0git.dylib:arm64+0x4dffd4)
    #11 CommandObjectTargetCreate::DoExecute(lldb_private::Args&, lldb_private::CommandReturnObject&) CommandObjectTarget.cpp:314 (liblldb.15.0.0git.dylib:arm64+0x2764760)
    #12 lldb_private::CommandObjectParsed::Execute(char const*, lldb_private::CommandReturnObject&) CommandObject.cpp:998 (liblldb.15.0.0git.dylib:arm64+0x3b19c0)
    #13 lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&) CommandInterpreter.cpp:1987 (liblldb.15.0.0git.dylib:arm64+0x3a4550)
    #14 lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp:3059 (liblldb.15.0.0git.dylib:arm64+0x3a8a30)
    #15 non-virtual thunk to lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp (liblldb.15.0.0git.dylib:arm64+0x3a8ecc)
    #16 lldb_private::IOHandlerEditline::Run() IOHandler.cpp (liblldb.15.0.0git.dylib:arm64+0x2948a4)
    #17 lldb_private::Debugger::RunIOHandlers() Debugger.cpp:1009 (liblldb.15.0.0git.dylib:arm64+0x26f7ac)
    #18 lldb_private::CommandInterpreter::RunCommandInterpreter(lldb_private::CommandInterpreterRunOptions&) CommandInterpreter.cpp:3309 (liblldb.15.0.0git.dylib:arm64+0x3aa360)
    #19 lldb::SBDebugger::RunCommandInterpreter(lldb::SBCommandInterpreterRunOptions const&) SBDebugger.cpp:1224 (liblldb.15.0.0git.dylib:arm64+0x54cdc)
    #20 Driver::MainLoop() Driver.cpp:575 (lldb:arm64+0x10000878c)
    #21 main Driver.cpp:851 (lldb:arm64+0x100009aa8)

  Mutex M119908344493401136 is already destroyed.

  Mutex M26458652249517616 is already destroyed.

  Mutex M208572962157285296 is already destroyed.

  Thread T3 (tid=5516949, running) created by main thread at:
    #0 pthread_create <null>:16654816 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2fe20)
    #1 llvm::llvm_execute_on_thread_impl(void* (*)(void*), void*, llvm::Optional<unsigned int>) Threading.inc:78 (liblldb.15.0.0git.dylib:arm64+0xa3ebe8)
    #2 llvm::ThreadPool::grow(int) ThreadPool.cpp:37 (liblldb.15.0.0git.dylib:arm64+0x9da0c4)
    #3 std::__1::shared_future<void> llvm::ThreadPool::asyncImpl<void>(std::__1::function<void ()>) ThreadPool.h:124 (liblldb.15.0.0git.dylib:arm64+0x4d5dc4)
    #4 lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&) Target.cpp:1635 (liblldb.15.0.0git.dylib:arm64+0x4c7600)
    #5 lldb_private::Target::SetExecutableModule(std::__1::shared_ptr<lldb_private::Module>&, lldb_private::LoadDependentFiles) Target.cpp:1468 (liblldb.15.0.0git.dylib:arm64+0x4c6738)
    #6 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, lldb_private::ArchSpec const&, lldb_private::LoadDependentFiles, std::__1::shared_ptr<lldb_private::Platform>&, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:320 (liblldb.15.0.0git.dylib:arm64+0x4e2ba8)
    #7 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:233 (liblldb.15.0.0git.dylib:arm64+0x4e1634)
    #8 lldb_private::TargetList::CreateTarget(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:52 (liblldb.15.0.0git.dylib:arm64+0x4dffd4)
    #9 CommandObjectTargetCreate::DoExecute(lldb_private::Args&, lldb_private::CommandReturnObject&) CommandObjectTarget.cpp:314 (liblldb.15.0.0git.dylib:arm64+0x2764760)
    #10 lldb_private::CommandObjectParsed::Execute(char const*, lldb_private::CommandReturnObject&) CommandObject.cpp:998 (liblldb.15.0.0git.dylib:arm64+0x3b19c0)
    #11 lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&) CommandInterpreter.cpp:1987 (liblldb.15.0.0git.dylib:arm64+0x3a4550)
    #12 lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp:3059 (liblldb.15.0.0git.dylib:arm64+0x3a8a30)
    #13 non-virtual thunk to lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp (liblldb.15.0.0git.dylib:arm64+0x3a8ecc)
    #14 lldb_private::IOHandlerEditline::Run() IOHandler.cpp (liblldb.15.0.0git.dylib:arm64+0x2948a4)
    #15 lldb_private::Debugger::RunIOHandlers() Debugger.cpp:1009 (liblldb.15.0.0git.dylib:arm64+0x26f7ac)
    #16 lldb_private::CommandInterpreter::RunCommandInterpreter(lldb_private::CommandInterpreterRunOptions&) CommandInterpreter.cpp:3309 (liblldb.15.0.0git.dylib:arm64+0x3aa360)
    #17 lldb::SBDebugger::RunCommandInterpreter(lldb::SBCommandInterpreterRunOptions const&) SBDebugger.cpp:1224 (liblldb.15.0.0git.dylib:arm64+0x54cdc)
    #18 Driver::MainLoop() Driver.cpp:575 (lldb:arm64+0x10000878c)
    #19 main Driver.cpp:851 (lldb:arm64+0x100009aa8)

  Thread T6 (tid=5516954, running) created by main thread at:
    #0 pthread_create <null>:16654816 (libclang_rt.tsan_osx_dynamic.dylib:arm64e+0x2fe20)
    #1 llvm::llvm_execute_on_thread_impl(void* (*)(void*), void*, llvm::Optional<unsigned int>) Threading.inc:78 (liblldb.15.0.0git.dylib:arm64+0xa3ebe8)
    #2 llvm::ThreadPool::grow(int) ThreadPool.cpp:37 (liblldb.15.0.0git.dylib:arm64+0x9d9fc8)
    #3 std::__1::shared_future<void> llvm::ThreadPool::asyncImpl<void>(std::__1::function<void ()>) ThreadPool.h:124 (liblldb.15.0.0git.dylib:arm64+0x4d5dc4)
    #4 lldb_private::Target::ModulesDidLoad(lldb_private::ModuleList&) Target.cpp:1635 (liblldb.15.0.0git.dylib:arm64+0x4c7600)
    #5 lldb_private::Target::SetExecutableModule(std::__1::shared_ptr<lldb_private::Module>&, lldb_private::LoadDependentFiles) Target.cpp:1468 (liblldb.15.0.0git.dylib:arm64+0x4c6738)
    #6 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, lldb_private::ArchSpec const&, lldb_private::LoadDependentFiles, std::__1::shared_ptr<lldb_private::Platform>&, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:320 (liblldb.15.0.0git.dylib:arm64+0x4e2ba8)
    #7 lldb_private::TargetList::CreateTargetInternal(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:233 (liblldb.15.0.0git.dylib:arm64+0x4e1634)
    #8 lldb_private::TargetList::CreateTarget(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::__1::shared_ptr<lldb_private::Target>&) TargetList.cpp:52 (liblldb.15.0.0git.dylib:arm64+0x4dffd4)
    #9 CommandObjectTargetCreate::DoExecute(lldb_private::Args&, lldb_private::CommandReturnObject&) CommandObjectTarget.cpp:314 (liblldb.15.0.0git.dylib:arm64+0x2764760)
    #10 lldb_private::CommandObjectParsed::Execute(char const*, lldb_private::CommandReturnObject&) CommandObject.cpp:998 (liblldb.15.0.0git.dylib:arm64+0x3b19c0)
    #11 lldb_private::CommandInterpreter::HandleCommand(char const*, lldb_private::LazyBool, lldb_private::CommandReturnObject&) CommandInterpreter.cpp:1987 (liblldb.15.0.0git.dylib:arm64+0x3a4550)
    #12 lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp:3059 (liblldb.15.0.0git.dylib:arm64+0x3a8a30)
    #13 non-virtual thunk to lldb_private::CommandInterpreter::IOHandlerInputComplete(lldb_private::IOHandler&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) CommandInterpreter.cpp (liblldb.15.0.0git.dylib:arm64+0x3a8ecc)
    #14 lldb_private::IOHandlerEditline::Run() IOHandler.cpp (liblldb.15.0.0git.dylib:arm64+0x2948a4)
    #15 lldb_private::Debugger::RunIOHandlers() Debugger.cpp:1009 (liblldb.15.0.0git.dylib:arm64+0x26f7ac)
    #16 lldb_private::CommandInterpreter::RunCommandInterpreter(lldb_private::CommandInterpreterRunOptions&) CommandInterpreter.cpp:3309 (liblldb.15.0.0git.dylib:arm64+0x3aa360)
    #17 lldb::SBDebugger::RunCommandInterpreter(lldb::SBCommandInterpreterRunOptions const&) SBDebugger.cpp:1224 (liblldb.15.0.0git.dylib:arm64+0x54cdc)
    #18 Driver::MainLoop() Driver.cpp:575 (lldb:arm64+0x10000878c)
    #19 main Driver.cpp:851 (lldb:arm64+0x100009aa8)

SUMMARY: ThreadSanitizer: data race ItaniumDemangle.h:3033 in llvm::itanium_demangle::AbstractManglingParser<llvm::itanium_demangle::ManglingParser<(anonymous namespace)::DefaultAllocator>, (anonymous namespace)::DefaultAllocator>::parseOperatorEncoding()

After applying this patch I started seeing data races reported by TSan when running the shell tests (check-lldb-shell). It seems to happen to different tests on different runs but the backtraces are the same.

That's some debug code using a static bool. I take it D123158 takes care of it?

llunak updated this revision to Diff 420885.Apr 6 2022, 8:32 AM

Rebased on ThreadPool groups (D123225) and adding such thread pool to LLDB (D123226).

llunak updated this revision to Diff 423247.Apr 16 2022, 10:57 AM

Adapted to API changes from D123225.

llunak edited the summary of this revision. (Show Details)May 3 2022, 9:48 PM
llunak added a comment.May 3 2022, 9:50 PM

The prerequisities fo this change have been pushed, so this one is ready.

labath accepted this revision.May 4 2022, 8:11 AM

Awesome. Thanks for your patience.

This revision is now accepted and ready to land.May 4 2022, 8:11 AM
This revision was automatically updated to reflect the committed changes.
labath added a comment.May 9 2022, 2:23 AM

I'm afraid I had to revert this, as it was causing hangs in TestMultipleDebuggers.py.

I haven't fully debugged this, but what I think is happening is this:

  • the test debug multiple (identical) inferiors in parallel
  • as a result the thread pool gets hit with many preload-symbols tasks for the same set of modules. As these tasks are taking the respective module mutexes, they cannot all make progress and a large part of them gets stuck.
  • those that to manage to make progress (take the module mutex) get to the dwarf indexing stage. while waiting for the indexing tasks to complete, they start to perform some other tasks
  • if one of those tasks is another preload-symbols task the can get stuck acquiring the module mutex. However, they will still be holding the mutex for the module that they are indexing, preventing the other threads from making progress
  • if this happens to multiple threads, they can form a loop in the module mutex waits, thereby hanging forever

I'm not entirely sure what would be the right way to fix this, but it seems that we either need to make sure the threads don't hold the mutexes while performing other tasks (not sure if that's possible), or rethink the do-tasks-while-you-wait idea.

It should be fairly easy to reproduce this by running TestMultipleDebuggers a couple of times, but if you run into problems doing that, I can provide you with some backtraces, logs, or something.

labath added a comment.May 9 2022, 2:47 AM

or rethink the do-tasks-while-you-wait idea.

I suppose one way to fix this would be to ensure that the waiting thread only picks up its own subtasks while its waiting. That would avoid these deadlocks, and also ensure that the wait is not unnecessarily delayed by a completely unrelated long-running task.